
An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.
Top Score
84.3%
Models Evaluated
3
Dataset Size
17,371 samples
Last Updated
June 4, 2026
Availability
A labeled NetFlow corpus with 17,371 evaluation units, of which 1,205 shared units are used in the published detection comparison across four analyst personas and three providers.
Number of Tasks
3
| Rank | Model | verdict accuracy | verdict f1 | flow f1 | pair f1 | host f1 | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|---|---|
| 1st | Claude Opus 4.7 claude-opus-4-7 • Anthropic | 84.3% | 88.1% | 53.6% | 50.8% | 66.6% | SOCBench authors | June 4, 2026 | Link |
| 2nd | Gemini 2.5 Pro gemini-2.5-pro • Google | 78.2% | 84.3% | 40.6% | 38.4% | 58.2% | SOCBench authors | June 4, 2026 | Link |
| 3rd | GPT-5.4 gpt-5-4 • OpenAI | 72.7% | 80.3% | 31.7% | 29.8% | 42.8% | SOCBench authors | June 4, 2026 | Link |