30 Benchmarks • 10 Categories

Cyber LLM
Benchmark Hub

The definitive source for cybersecurity LLM performance.
Compare models across

Explore Benchmarks View Leaderboards

Benchmarks

Models

Benchmark Categories

Comprehensive evaluation across 10 cybersecurity domains

Malware Analysis

1 benchmark

Penetration Testing

3 benchmarks

Incident Response

1 benchmark

Comprehensive Security

3 benchmarks

CTF Challenges

2 benchmarks

Vulnerability Analysis

7 benchmarks

Security Knowledge

6 benchmarks

Threat Intelligence

4 benchmarks

Threat Modeling

1 benchmark

LLM Safety & Jailbreaking

2 benchmarks

Featured Benchmarks

Latest cybersecurity LLM evaluation datasets

View All

CyberSOCEval

Malware Analysis

7 models

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Top Performer84.7%

GPT-4

View Details Compare

ExCyTIn-Bench

Comprehensive Security

7 models

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

Top Performer89.2%

GPT-4

View Details Compare

SANDBOXESCAPEBENCH

Llm Safety

9 models

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

Top Performer49.7%

GPT-5

View Details Compare

CyberExplorer

Penetration Testing

5 models

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

Top Performer42.5%

Qwen 3

View Details Compare

ZeroDayBench

Vulnerability Analysis

3 models

A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.

Top Performer56.0%

Claude 4.5 Sonnet

View Details Compare

VulnRepairEval

Vulnerability Analysis

12 models

An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.

Top Performer21.7%

Gemini 2.5 Pro

View Details Compare

Recently Added

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

AutoPenBench

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

OCCULT

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

BountyBench

Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SecLLMHolmes

A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Contribute to the Community

Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.

Submit Results Support the Project

30 Benchmarks • 10 Categories

Cyber LLM
Benchmark Hub

The definitive source for cybersecurity LLM performance.
Compare models across

Explore Benchmarks View Leaderboards

Benchmarks

Models

Benchmark Categories

Comprehensive evaluation across 10 cybersecurity domains

Malware Analysis

1 benchmark

Penetration Testing

3 benchmarks

Incident Response

1 benchmark

Comprehensive Security

3 benchmarks

CTF Challenges

2 benchmarks

Vulnerability Analysis

7 benchmarks

Security Knowledge

6 benchmarks

Threat Intelligence

4 benchmarks

Threat Modeling

1 benchmark

LLM Safety & Jailbreaking

2 benchmarks

Featured Benchmarks

Latest cybersecurity LLM evaluation datasets

View All

CyberSOCEval

Malware Analysis

7 models

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Top Performer84.7%

GPT-4

View Details Compare

ExCyTIn-Bench

Comprehensive Security

7 models

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

Top Performer89.2%

GPT-4

View Details Compare

SANDBOXESCAPEBENCH

Llm Safety

9 models

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

Top Performer49.7%

GPT-5

View Details Compare

CyberExplorer

Penetration Testing

5 models

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

Top Performer42.5%

Qwen 3

View Details Compare

ZeroDayBench

Vulnerability Analysis

3 models

A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.

Top Performer56.0%

Claude 4.5 Sonnet

View Details Compare

VulnRepairEval

Vulnerability Analysis

12 models

An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.

Top Performer21.7%

Gemini 2.5 Pro

View Details Compare

Recently Added

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

AutoPenBench

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

OCCULT

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

BountyBench

Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SecLLMHolmes

A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Contribute to the Community

Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.

Submit Results Support the Project

Cyber LLMBenchmark Hub

Benchmark Categories

Malware Analysis

Penetration Testing

Incident Response

Comprehensive Security

CTF Challenges

Vulnerability Analysis

Security Knowledge

Threat Intelligence

Threat Modeling

LLM Safety & Jailbreaking

Featured Benchmarks

Recently Added

DFIR-Metric

AutoPenBench

CVE-Bench

OCCULT

BountyBench

SEC-bench

SecLLMHolmes

Cybench

Contribute to the Community

Cyber LLMBenchmark Hub

Benchmark Categories

Malware Analysis

Penetration Testing

Incident Response

Comprehensive Security

CTF Challenges

Vulnerability Analysis

Security Knowledge

Threat Intelligence

Threat Modeling

LLM Safety & Jailbreaking

Featured Benchmarks

Recently Added

DFIR-Metric

AutoPenBench

CVE-Bench

OCCULT

BountyBench

SEC-bench

SecLLMHolmes

Cybench

Contribute to the Community

Cyber LLM
Benchmark Hub

Cyber LLM
Benchmark Hub