27 Benchmarks • 10 Categories

Cyber LLM
Benchmark Hub

The definitive source for cybersecurity LLM performance.
Compare models across

Explore Benchmarks View Leaderboards

Benchmarks

Models

Benchmark Categories

Comprehensive evaluation across 10 cybersecurity domains

Malware Analysis

1 benchmark

Penetration Testing

2 benchmarks

Incident Response

1 benchmark

Comprehensive Security

3 benchmarks

CTF Challenges

2 benchmarks

Vulnerability Analysis

5 benchmarks

Security Knowledge

6 benchmarks

Threat Intelligence

4 benchmarks

Threat Modeling

1 benchmark

LLM Safety & Jailbreaking

2 benchmarks

Featured Benchmarks

Latest cybersecurity LLM evaluation datasets

View All

CyberSOCEval

Malware Analysis

7 models

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Top Performer84.7%

GPT-4

View Details Compare

ExCyTIn-Bench

Comprehensive Security

7 models

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

Top Performer89.2%

GPT-4

View Details Compare

SANDBOXESCAPEBENCH

Llm Safety

9 models

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

Top Performer49.7%

GPT-5

View Details Compare

Recently Added

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

AutoPenBench

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

OCCULT

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

BountyBench

Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SecLLMHolmes

A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Contribute to the Community

Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.

Submit Results Support the Project

Cyber LLMBenchmark Hub

Benchmark Categories

Malware Analysis

Penetration Testing

Incident Response

Comprehensive Security

CTF Challenges

Vulnerability Analysis

Security Knowledge

Threat Intelligence

Threat Modeling

LLM Safety & Jailbreaking

Featured Benchmarks

Recently Added

DFIR-Metric

AutoPenBench

CVE-Bench

OCCULT

BountyBench

SEC-bench

SecLLMHolmes

Cybench

Contribute to the Community

Cyber LLMBenchmark Hub

Benchmark Categories

Malware Analysis

Penetration Testing

Incident Response

Comprehensive Security

CTF Challenges

Vulnerability Analysis

Security Knowledge

Threat Intelligence

Threat Modeling

LLM Safety & Jailbreaking

Featured Benchmarks

Recently Added

DFIR-Metric

AutoPenBench

CVE-Bench

OCCULT

BountyBench

SEC-bench

SecLLMHolmes

Cybench

Contribute to the Community

Cyber LLM
Benchmark Hub

Cyber LLM
Benchmark Hub