Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Incident ResponseThreat HuntingLog AnalysisIncident Triage

ExCyTIn-Bench

First benchmark to evaluate LLM agents on cyber threat investigation using security question-answering derived from real-world investigation graphs.

View Paper
Quick Stats

Top Score

60.6%

Models Evaluated

7

Dataset Size

7,542 samples

Last Updated

July 14, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
avg reward
Sources
ProjectDataset
Dataset Information

8 distinct multi-stage attack chains in a fictional Microsoft Azure tenant ('Alpine Ski House') covering 57 log tables from Microsoft Sentinel and related services. 7,542 questions are generated from bipartite incident graphs; 589 are used as the test set. Agents query a MySQL database and receive evaluator rewards (temperature=0, max_step=25, GPT-4o judge).

Number of Tasks

4

Threat InvestigationSql Query GenerationMulti Hop ReasoningLog Analysis
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelavg rewardEvaluated ByDateSource
1st
Claude 4.5 Opus
claude-opus-4-5 • Anthropic
60.6%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
2nd
GPT-5.1
gpt-5.1-reasoning-high • OpenAI
58.2%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
3rd
Claude 4.5 Sonnet
claude-sonnet-4-5 • Anthropic
48.7%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
#4
OpenAI o3
o3 • OpenAI
45.6%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
#5
GPT o4-mini
gpt-o4-mini • OpenAI
36.8%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
#6
Grok 4
grok-4 • xAI
34.4%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
#7
GPT-4.1
gpt-4.1 • OpenAI
33.8%ExCyTIn-Bench authors (Microsoft Security AI Research)November 4, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub