Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Incident ResponseLog AnalysisThreat HuntingForensics

SIR-Bench

Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.

View Paper
Quick Stats

Top Score

94.2%

Models Evaluated

1

Dataset Size

794 samples

Last Updated

April 13, 2026

Availability

Dataset ✗Code ✗
Metrics Tracked
triage f3-scorenovel finding-coverage
Dataset Information

794 test cases derived from 129 anonymized incident patterns across Brute Force, Unauthorized Access, Misconfiguration, and Malicious File Execution. Includes 475 True Positives and 319 False Positives. Uses Once Upon A Threat (OUAT) framework to replay patterns in controlled cloud environments.

Number of Tasks

3

TriageEvidence GatheringFinding Discovery
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeltriage f3-scorenovel finding-coverageEvaluated ByDateSource
1st
SIR Agent
sir-agent-frontier • Amazon Web Services
94.2%41.9%SIR-Bench authors (AWS)April 13, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub