Incident ResponseLog AnalysisThreat HuntingForensics

SIR-Bench

Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.

View Paper

Quick Stats

Top Score

94.2%

Models Evaluated

Dataset Size

794 samples

Last Updated

April 13, 2026

Availability

Dataset ✗Code ✗

Metrics Tracked

triage f3-scorenovel finding-coverage

Dataset Information

794 test cases derived from 129 anonymized incident patterns across Brute Force, Unauthorized Access, Misconfiguration, and Malicious File Execution. Includes 475 True Positives and 319 False Positives. Uses Once Upon A Threat (OUAT) framework to replay patterns in controlled cloud environments.

Number of Tasks

TriageEvidence GatheringFinding Discovery

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	triage f3-score	novel finding-coverage	Evaluated By	Date	Source
1st	SIR Agent sir-agent-frontier • Amazon Web Services	94.2%	41.9%	SIR-Bench authors (AWS)	April 13, 2026	Link