
Discover 35 LLM benchmarks across cybersecurity domains
Showing 35 of 35 benchmarks
An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.
Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel
Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution
Agentic Threat Hunting Evaluation for LLMs in SecOps — measures how well LLM agents perform the core SOC analyst task of threat hunting on raw Windows event logs
Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.
A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.
Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.
Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.
A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence with six specialized CTI tasks: Knowledge Testing (CKT), Technique Extraction (ATE), Report Matching (RCM), Report Summarization (RMS), Threat Attribution (TAA), and Vulnerability Prediction (VSP)
Cybersecurity AI Benchmark - A Meta-Benchmark for Evaluating Cybersecurity AI Agents
Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence
Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.
First benchmark to evaluate LLM agents on cyber threat investigation using security question-answering derived from real-world investigation graphs.
Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Evaluating Large Language Models for Offensive Cyber Operation Capabilities
Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models
A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity
A Comprehensive Large Language Model Benchmark for CyberSecurity
Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems
A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
Security Extraction, Understanding & Reasoning Evaluation - Benchmarking LLMs for Cybersecurity
Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence
A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
A Multi-Task Benchmark for Evaluating Large Language Models in Cybersecurity
A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security
A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning
A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models