
The definitive source for cybersecurity LLM performance.
Compare models across
Comprehensive evaluation across 10 cybersecurity domains
1 benchmark
2 benchmarks
1 benchmark
3 benchmarks
2 benchmarks
5 benchmarks
6 benchmarks
4 benchmarks
1 benchmark
2 benchmarks
Latest cybersecurity LLM evaluation datasets
Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts
Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.
A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems
A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Evaluating Large Language Models for Offensive Cyber Operation Capabilities
Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning
A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.