
Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.
Top Score
94.2%
Models Evaluated
1
Dataset Size
794 samples
Last Updated
April 13, 2026
Availability
794 test cases derived from 129 anonymized incident patterns across Brute Force, Unauthorized Access, Misconfiguration, and Malicious File Execution. Includes 475 True Positives and 319 False Positives. Uses Once Upon A Threat (OUAT) framework to replay patterns in controlled cloud environments.
Number of Tasks
3
| Rank | Model | triage f3-score | novel finding-coverage | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | SIR Agent sir-agent-frontier • Amazon Web Services | 94.2% | 41.9% | SIR-Bench authors (AWS) | April 13, 2026 | Link |