
First benchmark to evaluate LLM agents on cyber threat investigation using security question-answering derived from real-world investigation graphs.
Top Score
60.6%
Models Evaluated
7
Dataset Size
7,542 samples
Last Updated
July 14, 2025
Availability
8 distinct multi-stage attack chains in a fictional Microsoft Azure tenant ('Alpine Ski House') covering 57 log tables from Microsoft Sentinel and related services. 7,542 questions are generated from bipartite incident graphs; 589 are used as the test set. Agents query a MySQL database and receive evaluator rewards (temperature=0, max_step=25, GPT-4o judge).
Number of Tasks
4
| Rank | Model | avg reward | Evaluated By | Date | Source |
|---|---|---|---|---|---|
| 1st | Claude 4.5 Opus claude-opus-4-5 • Anthropic | 60.6% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |
| 2nd | GPT-5.1 gpt-5.1-reasoning-high • OpenAI | 58.2% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |
| 3rd | Claude 4.5 Sonnet claude-sonnet-4-5 • Anthropic | 48.7% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |
| #4 | OpenAI o3 o3 • OpenAI | 45.6% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |
| #5 | GPT o4-mini gpt-o4-mini • OpenAI | 36.8% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |
| #6 | Grok 4 grok-4 • xAI | 34.4% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |
| #7 | GPT-4.1 gpt-4.1 • OpenAI | 33.8% | ExCyTIn-Bench authors (Microsoft Security AI Research) | November 4, 2025 | Link |