
The definitive source for cybersecurity LLM performance.
Compare models across
Comprehensive evaluation across 10 cybersecurity domains
1 benchmark
3 benchmarks
5 benchmarks
2 benchmarks
2 benchmarks
9 benchmarks
6 benchmarks
4 benchmarks
1 benchmark
2 benchmarks
Latest cybersecurity LLM evaluation datasets
An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.
Agentic Threat Hunting Evaluation for LLMs in SecOps — measures how well LLM agents perform the core SOC analyst task of threat hunting on raw Windows event logs
Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution
Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel
Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.
A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.