Incident ResponseThreat HuntingLog AnalysisIncident Triage

ExCyTIn-Bench

First benchmark to evaluate LLM agents on cyber threat investigation using security question-answering derived from real-world investigation graphs.

View Paper

Quick Stats

Top Score

60.6%

Models Evaluated

Dataset Size

7,542 samples

Last Updated

July 14, 2025

Availability

Dataset ✓Code ✓

Metrics Tracked

avg reward

Sources

Project Dataset

Dataset Information

8 distinct multi-stage attack chains in a fictional Microsoft Azure tenant ('Alpine Ski House') covering 57 log tables from Microsoft Sentinel and related services. 7,542 questions are generated from bipartite incident graphs; 589 are used as the test set. Agents query a MySQL database and receive evaluator rewards (temperature=0, max_step=25, GPT-4o judge).

Number of Tasks

Threat InvestigationSql Query GenerationMulti Hop ReasoningLog Analysis

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	avg reward	Evaluated By	Date	Source
1st	Claude 4.5 Opus claude-opus-4-5 • Anthropic	60.6%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link
2nd	GPT-5.1 gpt-5.1-reasoning-high • OpenAI	58.2%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link
3rd	Claude 4.5 Sonnet claude-sonnet-4-5 • Anthropic	48.7%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link
#4	OpenAI o3 o3 • OpenAI	45.6%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link
#5	GPT o4-mini gpt-o4-mini • OpenAI	36.8%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link
#6	Grok 4 grok-4 • xAI	34.4%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link
#7	GPT-4.1 gpt-4.1 • OpenAI	33.8%	ExCyTIn-Bench authors (Microsoft Security AI Research)	November 4, 2025	Link