Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
30 Benchmarks • 10 Categories

Cyber LLM
Benchmark Hub

The definitive source for cybersecurity LLM performance.
Compare models across

Explore BenchmarksView Leaderboards
30
Benchmarks
25
Models
10
Categories
43
Results

Benchmark Categories

Comprehensive evaluation across 10 cybersecurity domains

Malware Analysis

1 benchmark

Penetration Testing

3 benchmarks

Incident Response

1 benchmark

Comprehensive Security

3 benchmarks

CTF Challenges

2 benchmarks

Vulnerability Analysis

7 benchmarks

Security Knowledge

6 benchmarks

Threat Intelligence

4 benchmarks

Threat Modeling

1 benchmark

LLM Safety & Jailbreaking

2 benchmarks

Featured Benchmarks

Latest cybersecurity LLM evaluation datasets

View All
CyberSOCEval
Malware Analysis
7 models

Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Top Performer84.7%
GPT-4
View DetailsCompare
ExCyTIn-Bench
Comprehensive Security
7 models

Microsoft's benchmark for measuring AI capabilities in cybersecurity contexts

Top Performer89.2%
GPT-4
View DetailsCompare
SANDBOXESCAPEBENCH
Llm Safety
9 models

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

Top Performer49.7%
GPT-5
View DetailsCompare
CyberExplorer
Penetration Testing
5 models

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

Top Performer42.5%
Qwen 3
View DetailsCompare
ZeroDayBench
Vulnerability Analysis
3 models

A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.

Top Performer56.0%
Claude 4.5 Sonnet
View DetailsCompare
VulnRepairEval
Vulnerability Analysis
12 models

An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.

Top Performer21.7%
Gemini 2.5 Pro
View DetailsCompare

Recently Added

DFIR-Metric

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

AutoPenBench

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

OCCULT

Evaluating Large Language Models for Offensive Cyber Operation Capabilities

BountyBench

Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

SecLLMHolmes

A Comprehensive Evaluation Framework and Benchmarks for LLMs in Security Vulnerability Identification and Reasoning

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Contribute to the Community

Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.

Submit ResultsSupport the Project
Cyber LLM Hub
Cyber LLM Hub

© 2026 Cyber LLM Benchmark Hub. Built with ❤️ for the cybersecurity community.