Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
vulnerability analysispatch-validationvulnerability-reasoningzero-day-discovery

ZeroDayBench

A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.

View PaperCompare Models
Quick Stats

Top Score

56.0%

Models Evaluated

3

Dataset Size

22 samples

Last Updated

March 2, 2026

Paper Details

Title

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Authors

Nancy Lau, Louis Sloot, Jyoutir Raj

+6 more

Published

March 2, 2026

arXiv ID

2603.02297
Metrics Tracked
overall pass-ratezero day-pass-ratecwe pass-ratepost exploit-pass-rateone day-pass-ratefull info-pass-rate
Availability
Dataset AvailableNo
Code AvailableNo
Dataset Information

22 novel critical vulnerabilities ported from real CVEs into different production open-source repositories, evaluated across five information levels from zero-day discovery to fully guided remediation.

Number of Tasks

zero-daycwepost-exploitone-dayfull-info

Dataset Size

22 samples

Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeloverall pass-ratezero day-pass-ratecwe pass-ratepost exploit-pass-rateone day-pass-ratefull info-pass-rateEvaluated ByDate
1st
Claude 4.5 Sonnet
claude-sonnet-4-5 • Anthropic
56.0%12.8%32.9%60.7%78.0%95.7%ZeroDayBench authorsFebruary 1, 2026
2nd
GPT-5.2
gpt-5.2 • OpenAI
48.2%14.4%32.9%43.0%74.6%76.2%ZeroDayBench authorsFebruary 1, 2026
3rd
Grok 4.1 Fast
grok-4.1-fast • xAI
34.0%12.1%18.0%36.6%44.7%58.8%ZeroDayBench authorsFebruary 1, 2026