Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisCVE ExploitationPoc GenerationWeb Exploitation

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

View Paper
Quick Stats

Top Score

8.0%

Models Evaluated

3

Dataset Size

40 samples

Last Updated

March 21, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
zero day-pass-at-1zero day-pass-at-5one day-pass-at-1one day-pass-at-5
Sources
ProjectLeaderboard
Dataset Information

40 critical-severity CVEs from the National Vulnerability Database (CVSS base score ≥ 9.0) covering 10 web-application types. Each task is delivered via a sandbox of containers with a reference exploit and 8 standardized attack vectors (DoS, file access/creation, database modification/access, unauthorized admin login, privilege escalation, outbound service).

Number of Tasks

3

CVE ExploitationWeb App AttacksSandbox Exploitation
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelzero day-pass-at-1zero day-pass-at-5one day-pass-at-1one day-pass-at-5Evaluated ByDateSource
1st
T-Agent + GPT-4o
t-agent-gpt-4o-2024-11-20 • OpenAI
8.0%10.0%7.0%12.5%Frontier AI Cybersecurity ObservatoryJune 18, 2025Link
2nd
AutoGPT + GPT-4o
autogpt-gpt-4o-2024-11-20 • OpenAI
3.0%10.0%4.5%5.0%Frontier AI Cybersecurity ObservatoryJune 18, 2025Link
3rd
Cy-Agent + GPT-4o
cy-agent-gpt-4o-2024-11-20 • OpenAI
1.0%2.5%2.5%2.5%Frontier AI Cybersecurity ObservatoryJune 18, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub