
A benchmark for LLM agents that must find and patch novel high-severity vulnerabilities ported into real-world open-source codebases.
Top Score
56.0%
Models Evaluated
3
Dataset Size
22 samples
Last Updated
March 2, 2026
Title
ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense
Authors
Nancy Lau, Louis Sloot, Jyoutir Raj
+6 more
Published
March 2, 2026
arXiv ID
2603.0229722 novel critical vulnerabilities ported from real CVEs into different production open-source repositories, evaluated across five information levels from zero-day discovery to fully guided remediation.
Number of Tasks
zero-daycwepost-exploitone-dayfull-info
Dataset Size
22 samples
| Rank | Model | overall pass-rate | zero day-pass-rate | cwe pass-rate | post exploit-pass-rate | one day-pass-rate | full info-pass-rate | Evaluated By | Date |
|---|---|---|---|---|---|---|---|---|---|
| 1st | Claude 4.5 Sonnet claude-sonnet-4-5 • Anthropic | 56.0% | 12.8% | 32.9% | 60.7% | 78.0% | 95.7% | ZeroDayBench authors | February 1, 2026 |
| 2nd | GPT-5.2 gpt-5.2 • OpenAI | 48.2% | 14.4% | 32.9% | 43.0% | 74.6% | 76.2% | ZeroDayBench authors | February 1, 2026 |
| 3rd | Grok 4.1 Fast grok-4.1-fast • xAI | 34.0% | 12.1% | 18.0% | 36.6% | 44.7% | 58.8% | ZeroDayBench authors | February 1, 2026 |