
Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution
Top Score
69.0%
Models Evaluated
7
Dataset Size
41 samples
Last Updated
May 1, 2026
Availability
41 real V8 CVEs graded across 5 tiers (T5 coverage → T1 full arbitrary code execution) with 16 capabilities measured per run. Runs are graded against production V8 with the security sandbox enabled. All grading is deterministic (no LLM-as-judge).
Number of Tasks
3
| Rank | Model | ace rate | mean score | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | Claude Mythos Preview claude-mythos-preview-autonudge • Anthropic | 69.0% | 61.9% | ExploitBench authors (CMU) | May 1, 2026 | Link |
| 2nd | Claude Mythos Preview claude-mythos-preview • Anthropic | 68.0% | 59.7% | ExploitBench authors (CMU) | May 1, 2026 | Link |
| 3rd | GPT-5.5 gpt-5-5-codex-autonudge • OpenAI | 41.0% | 34.4% | ExploitBench authors (CMU) | May 1, 2026 | Link |
| #4 | GPT-5.5 gpt-5-5-autonudge • OpenAI | 34.0% | 27.8% | ExploitBench authors (CMU) | May 1, 2026 | Link |
| #5 | GPT-5.5 gpt-5-5-codex • OpenAI | 33.0% | 26.9% | ExploitBench authors (CMU) | May 1, 2026 | Link |
| #6 | GPT-5.5 gpt-5-5 • OpenAI | 29.0% | 23.5% | ExploitBench authors (CMU) | May 1, 2026 | Link |
| #7 | Claude Opus 4.7 claude-opus-4-7-autonudge • Anthropic | 27.0% | 22.9% | ExploitBench authors (CMU) | May 1, 2026 | Link |