Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisPatch ValidationVulnerability ReasoningZero Day Discovery

ZeroDayBench

A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.

View Paper
Quick Stats

Top Score

56.0%

Models Evaluated

3

Dataset Size

22 samples

Last Updated

March 2, 2026

Availability

Dataset ✗Code ✗
Metrics Tracked
overall pass-ratezero day-pass-ratecwe pass-ratepost exploit-pass-rateone day-pass-ratefull info-pass-rate
Dataset Information

22 novel critical vulnerabilities ported from real CVEs into different production open-source repositories, evaluated across five information levels from zero-day discovery to fully guided remediation.

Number of Tasks

5

Zero DayCwePost ExploitOne DayFull Info
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeloverall pass-ratezero day-pass-ratecwe pass-ratepost exploit-pass-rateone day-pass-ratefull info-pass-rateEvaluated ByDateSource
1st
Claude 4.5 Sonnet
claude-sonnet-4-5 • Anthropic
56.0%12.8%32.9%60.7%78.0%95.7%ZeroDayBench authorsFebruary 1, 2026Link
2nd
GPT-5.2
gpt-5.2 • OpenAI
48.2%14.4%32.9%43.0%74.6%76.2%ZeroDayBench authorsFebruary 1, 2026Link
3rd
Grok 4.1 Fast
grok-4.1-fast • xAI
34.0%12.1%18.0%36.6%44.7%58.8%ZeroDayBench authorsFebruary 1, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub