Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Incident ResponseThreat HuntingLog AnalysisDigital Forensics

Cyber Defense Benchmark

Agentic Threat Hunting Evaluation for LLMs in SecOps — measures how well LLM agents perform the core SOC analyst task of threat hunting on raw Windows event logs

View Paper
Quick Stats

Top Score

3.8%

Models Evaluated

5

Dataset Size

106 samples

Last Updated

April 28, 2026

Availability

Dataset ✓Code ✓
Metrics Tracked
flag recalltactics passed
Dataset Information

106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics, wrapped in a Gymnasium RL environment. Each episode presents 75,000-135,000 Windows event log records in an in-memory SQLite database.

Number of Tasks

3

Threat HuntingSql Query FormulationMalicious Event Flagging
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelflag recalltactics passedEvaluated ByDateSource
1st
Claude Opus 4.6
claude-opus-4-6 • Anthropic
3.8%38.5%Cyber Defense Benchmark authorsApril 28, 2026Link
2nd
GPT-5
gpt-5 • OpenAI
2.9%0.0%Cyber Defense Benchmark authorsApril 28, 2026Link
3rd
Gemini 3.1 Pro
gemini-3-1-pro • Google
2.1%0.0%Cyber Defense Benchmark authorsApril 28, 2026Link
#4
Kimi K2.5
kimi-k2-5 • Moonshot AI
1.7%0.0%Cyber Defense Benchmark authorsApril 28, 2026Link
#5
Gemini 3 Flash
gemini-3-flash • Google
1.3%0.0%Cyber Defense Benchmark authorsApril 28, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub