Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support

Cybersecurity Benchmarks

Discover 35 LLM benchmarks across cybersecurity domains

35 Benchmarks

Showing 35 of 35 benchmarks

SOCBench
Incident Response

An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.

3 models17,371 samples
Top Performer84.3%
Claude Opus 4.7
View Details
June 4, 2026Website
ExploitGym
Vulnerability Analysis

Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel

7 models898 samples
Top Performer17.5%
Claude Mythos Preview
View Details
May 19, 2026Paper
ExploitBench
Vulnerability Analysis

Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution

7 models41 samples
Top Performer69.0%
Claude Mythos Preview
View Details
May 1, 2026Website
Cyber Defense Benchmark
Incident Response

Agentic Threat Hunting Evaluation for LLMs in SecOps — measures how well LLM agents perform the core SOC analyst task of threat hunting on raw Windows event logs

5 models106 samples
Top Performer3.8%
Claude Opus 4.6
View Details
April 28, 2026Paper
SIR-Bench
Incident Response

Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.

1 models794 samples
Top Performer94.2%
SIR Agent
View Details
April 13, 2026Paper
ZeroDayBench
Vulnerability Analysis

A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.

3 models22 samples
Top Performer56.0%
Claude 4.5 Sonnet
View Details
March 2, 2026Paper
SANDBOXESCAPEBENCH
LLM Safety

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

9 models18 samples
Top Performer49.7%
GPT-5
View Details
March 1, 2026Paper
CyberExplorer
Penetration Testing

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

5 models40 samples
Top Performer42.5%
Qwen 3
View Details
February 8, 2026Paper
AthenaBench
Threat Intelligence

A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence with six specialized CTI tasks: Knowledge Testing (CKT), Technique Extraction (ATE), Report Matching (RCM), Report Summarization (RMS), Threat Attribution (TAA), and Vulnerability Prediction (VSP)

5 models3,000 samples
Top Performer92.0%
GPT-5
View Details
November 3, 2025Paper
CAIBench
Comprehensive Security

Cybersecurity AI Benchmark - A Meta-Benchmark for Evaluating Cybersecurity AI Agents

0 models10,000 samples
View Details
October 28, 2025Paper
CTIArena
Threat Intelligence

Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence

5 models691 samples
Top Performer71.2%
Qwen 3
View Details
October 13, 2025Paper
CyberSOCEval
Malware Analysis

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

6 models1,197 samples
Top Performer33.5%
Claude 3.7 Sonnet
View Details
September 24, 2025Paper
VulnRepairEval
Vulnerability Analysis

An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.

12 models23 samples
Top Performer21.7%
Gemini 2.5 Pro
View Details
September 3, 2025Paper
ExCyTIn-Bench
Incident Response

First benchmark to evaluate LLM agents on cyber threat investigation using security question-answering derived from real-world investigation graphs.

7 models7,542 samples
Top Performer60.6%
Claude 4.5 Opus
View Details
July 14, 2025Paper
SEC-bench
Vulnerability Analysis

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

3 models200 samples
Top Performer18.0%
OpenHands + Claude 3.7 Sonnet
View Details
June 13, 2025Paper
CyberGym
Vulnerability Analysis

Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

9 models1,507 samples
Top Performer17.8%
OpenHands + Claude Sonnet 4
View Details
June 3, 2025Paper
DFIR-Metric
Incident Response

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

6 models1,350 samples
Top Performer93.0%
GPT-4o
View Details
May 26, 2025Paper
BountyBench
Vulnerability Analysis

Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

5 models40 samples
Top Performer5.0%
Claude Code
View Details
May 21, 2025Paper
CVE-Bench
Vulnerability Analysis

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

3 models40 samples
Top Performer8.0%
T-Agent + GPT-4o
View Details
March 21, 2025Paper
OCCULT
Penetration Testing

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

5 models180 samples
Top Performer91.8%
DeepSeek-R1
View Details
February 18, 2025Paper
CySecBench
LLM Safety

Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

0 models12,662 samples
View Details
January 2, 2025Paper
TM-Bench
Threat Modeling

A Benchmark for LLM-Based Threat Modeling

0 models
View Details
January 1, 2025Website
SecBench
Security Knowledge

A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

5 models47,910 samples
Top Performer94.3%
Hunyuan Turbo
View Details
December 30, 2024Paper
CS-Eval
Security Knowledge

A Comprehensive Large Language Model Benchmark for CyberSecurity

5 models4,369 samples
Top Performer87.6%
GPT-4
View Details
November 25, 2024Paper
AutoPenBench
Penetration Testing

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

2 models33 samples
Top Performer64.0%
AutoPenBench Assisted Agent
View Details
October 4, 2024Paper
Cybench
CTF Challenges

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

8 models40 samples
Top Performer17.5%
Claude 3.5
View Details
August 15, 2024Paper
CTIBench
Threat Intelligence

A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

5 models2,500 samples
Top Performer71.0%
GPT-4
View Details
June 11, 2024Paper
NYU CTF Bench
CTF Challenges

A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

3 models200 samples
Top Performer13.5%
Claude 3.5
View Details
June 8, 2024Paper
SECURE
Security Knowledge

Security Extraction, Understanding & Reasoning Evaluation - Benchmarking LLMs for Cybersecurity

6 models6 samples
Top Performer88.6%
GPT-4
View Details
May 30, 2024Paper
SEvenLLM
Threat Intelligence

Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence

0 models1,300 samples
View Details
May 6, 2024Paper
CyberMetric
Security Knowledge

A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

6 models10,000 samples
Top Performer91.3%
GPT-4o
View Details
February 12, 2024Paper
CyberBench (JPMorgan)
Comprehensive Security

A Multi-Task Benchmark for Evaluating Large Language Models in Cybersecurity

3 models
Top Performer69.9%
GPT-4
View Details
January 1, 2024Paper
SecQA
Security Knowledge

A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security

8 models242 samples
Top Performer99.1%
GPT-3.5
View Details
December 26, 2023Paper
SecLLMHolmes
Vulnerability Analysis

A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning

0 models228 samples
View Details
December 19, 2023Paper
SecEval
Security Knowledge

A Comprehensive Benchmark for Evaluating Cybersecurity Knowledge of Foundation Models

0 models2,126 samples
View Details
December 1, 2023Website