
The definitive source for cybersecurity LLM performance.
Compare models across
Comprehensive evaluation across 10 cybersecurity domains
1 benchmark
3 benchmarks
1 benchmark
3 benchmarks
2 benchmarks
7 benchmarks
6 benchmarks
4 benchmarks
1 benchmark
2 benchmarks
Latest cybersecurity LLM evaluation datasets
Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts
Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.
Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.
A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.
An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.
A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems
A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Evaluating Large Language Models for Offensive Cyber Operation Capabilities
Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning
A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.