Incident ResponseLog AnalysisThreat HuntingIncident Triage

SOCBench

An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.

Visit Website

Quick Stats

Top Score

84.3%

Models Evaluated

Dataset Size

17,371 samples

Last Updated

June 4, 2026

Availability

Dataset ✓Code ✓

Metrics Tracked

verdict accuracyverdict f1flow f1pair f1host f1

Sources

Code Dataset

Dataset Information

A labeled NetFlow corpus with 17,371 evaluation units, of which 1,205 shared units are used in the published detection comparison across four analyst personas and three providers.

Number of Tasks

Netflow InvestigationSOC AnalysisThreat Hunting

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	verdict accuracy	verdict f1	flow f1	pair f1	host f1	Evaluated By	Date	Source
1st	Claude Opus 4.7 claude-opus-4-7 • Anthropic	84.3%	88.1%	53.6%	50.8%	66.6%	SOCBench authors	June 4, 2026	Link
2nd	Gemini 2.5 Pro gemini-2.5-pro • Google	78.2%	84.3%	40.6%	38.4%	58.2%	SOCBench authors	June 4, 2026	Link
3rd	GPT-5.4 gpt-5-4 • OpenAI	72.7%	80.3%	31.7%	29.8%	42.8%	SOCBench authors	June 4, 2026	Link