
Agentic Threat Hunting Evaluation for LLMs in SecOps — measures how well LLM agents perform the core SOC analyst task of threat hunting on raw Windows event logs
Top Score
3.8%
Models Evaluated
5
Dataset Size
106 samples
Last Updated
April 28, 2026
Availability
106 real attack procedures from the OTRF Security-Datasets corpus, spanning 86 MITRE ATT&CK sub-techniques across 12 tactics, wrapped in a Gymnasium RL environment. Each episode presents 75,000-135,000 Windows event log records in an in-memory SQLite database.
Number of Tasks
3
| Rank | Model | flag recall | tactics passed | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | Claude Opus 4.6 claude-opus-4-6 • Anthropic | 3.8% | 38.5% | Cyber Defense Benchmark authors | April 28, 2026 | Link |
| 2nd | GPT-5 gpt-5 • OpenAI | 2.9% | 0.0% | Cyber Defense Benchmark authors | April 28, 2026 | Link |
| 3rd | Gemini 3.1 Pro gemini-3-1-pro • Google | 2.1% | 0.0% | Cyber Defense Benchmark authors | April 28, 2026 | Link |
| #4 | Kimi K2.5 kimi-k2-5 • Moonshot AI | 1.7% | 0.0% | Cyber Defense Benchmark authors | April 28, 2026 | Link |
| #5 | Gemini 3 Flash gemini-3-flash • Google | 1.3% | 0.0% | Cyber Defense Benchmark authors | April 28, 2026 | Link |