Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Incident ResponseLog AnalysisThreat HuntingIncident Triage

SOCBench

An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.

Visit Website
Quick Stats

Top Score

84.3%

Models Evaluated

3

Dataset Size

17,371 samples

Last Updated

June 4, 2026

Availability

Dataset ✓Code ✓
Metrics Tracked
verdict accuracyverdict f1flow f1pair f1host f1
Sources
CodeDataset
Dataset Information

A labeled NetFlow corpus with 17,371 evaluation units, of which 1,205 shared units are used in the published detection comparison across four analyst personas and three providers.

Number of Tasks

3

Netflow InvestigationSOC AnalysisThreat Hunting
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelverdict accuracyverdict f1flow f1pair f1host f1Evaluated ByDateSource
1st
Claude Opus 4.7
claude-opus-4-7 • Anthropic
84.3%88.1%53.6%50.8%66.6%SOCBench authorsJune 4, 2026Link
2nd
Gemini 2.5 Pro
gemini-2.5-pro • Google
78.2%84.3%40.6%38.4%58.2%SOCBench authorsJune 4, 2026Link
3rd
GPT-5.4
gpt-5-4 • OpenAI
72.7%80.3%31.7%29.8%42.8%SOCBench authorsJune 4, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub