Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
35 Benchmarks • 10 Categories

Cyber LLM
Benchmark Hub

The definitive source for cybersecurity LLM performance.
Compare models across

Explore Benchmarks
35
Benchmarks
78
Models
10
Categories
157
Results

Benchmark Categories

Comprehensive evaluation across 10 cybersecurity domains

Malware Analysis

1 benchmark

Penetration Testing

3 benchmarks

Incident Response

5 benchmarks

Comprehensive Security

2 benchmarks

CTF Challenges

2 benchmarks

Vulnerability Analysis

9 benchmarks

Security Knowledge

6 benchmarks

Threat Intelligence

4 benchmarks

Threat Modeling

1 benchmark

LLM Safety & Jailbreaking

2 benchmarks

Featured Benchmarks

Latest cybersecurity LLM evaluation datasets

View All
SOCBench
Incident Response
3 models

An open benchmark for AI in cybersecurity operations. SOCBench benchmarks frontier reasoning LLMs as SOC agents on raw NetFlow data with a shared evaluation corpus, fixed budgets, and strict final-answer contracts.

Top Performer84.3%
Claude Opus 4.7
View Details
Cyber Defense Benchmark
Incident Response
5 models

Agentic Threat Hunting Evaluation for LLMs in SecOps — measures how well LLM agents perform the core SOC analyst task of threat hunting on raw Windows event logs

Top Performer3.8%
Claude Opus 4.6
View Details
ExploitBench
Vulnerability Analysis
7 models

Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution

Top Performer69.0%
Claude Mythos Preview
View Details
ExploitGym
Vulnerability Analysis
7 models

Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel

Top Performer17.5%
Claude Mythos Preview
View Details
SIR-Bench
Incident Response
1 models

Benchmark evaluating investigation depth in autonomous Security Incident Response agents, distinguishing genuine forensic investigation from alert parroting.

Top Performer94.2%
SIR Agent
View Details
DFIR-Metric
Incident Response
6 models

A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Top Performer93.0%
GPT-4o
View Details

Contribute to the Community

Help build the most comprehensive cybersecurity LLM benchmark database. Submit your evaluation results or support the project.

Support the Project
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub