Vulnerability AnalysisCVE ExploitationPoc Generation

ExploitBench

Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution

Visit Website

Quick Stats

Top Score

69.0%

Models Evaluated

Dataset Size

41 samples

Last Updated

May 1, 2026

Availability

Dataset ✓Code ✓

Metrics Tracked

ace ratemean score

Sources

Github

Dataset Information

41 real V8 CVEs graded across 5 tiers (T5 coverage → T1 full arbitrary code execution) with 16 capabilities measured per run. Runs are graded against production V8 with the security sandbox enabled. All grading is deterministic (no LLM-as-judge).

Number of Tasks

V8 Exploit SynthesisArbitrary Code ExecutionSandbox Escape

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	ace rate	mean score	Evaluated By	Date	Source
1st	Claude Mythos Preview claude-mythos-preview-autonudge • Anthropic	69.0%	61.9%	ExploitBench authors (CMU)	May 1, 2026	Link
2nd	Claude Mythos Preview claude-mythos-preview • Anthropic	68.0%	59.7%	ExploitBench authors (CMU)	May 1, 2026	Link
3rd	GPT-5.5 gpt-5-5-codex-autonudge • OpenAI	41.0%	34.4%	ExploitBench authors (CMU)	May 1, 2026	Link
#4	GPT-5.5 gpt-5-5-autonudge • OpenAI	34.0%	27.8%	ExploitBench authors (CMU)	May 1, 2026	Link
#5	GPT-5.5 gpt-5-5-codex • OpenAI	33.0%	26.9%	ExploitBench authors (CMU)	May 1, 2026	Link
#6	GPT-5.5 gpt-5-5 • OpenAI	29.0%	23.5%	ExploitBench authors (CMU)	May 1, 2026	Link
#7	Claude Opus 4.7 claude-opus-4-7-autonudge • Anthropic	27.0%	22.9%	ExploitBench authors (CMU)	May 1, 2026	Link