
Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel
Top Score
17.5%
Models Evaluated
7
Dataset Size
898 samples
Last Updated
May 19, 2026
Availability
898 instances sourced from real-world vulnerabilities across three domains: 520 userspace programs (OSS-Fuzz/CyberGym), 185 V8 JavaScript engine bugs, and 193 Linux kernel privilege-escalation vulnerabilities. Evaluated with and without standard mitigations enabled.
Number of Tasks
3
| Rank | Model | success rate | success count | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | Claude Mythos Preview claude-mythos-preview • Anthropic | 17.5% | 17.5% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |
| 2nd | GPT-5.5 gpt-5-5 • OpenAI | 13.4% | 13.4% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |
| 3rd | GPT-5.4 gpt-5-4 • OpenAI | 6.0% | 6.0% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |
| #4 | Claude Opus 4.6 claude-opus-4-6 • Anthropic | 1.7% | 1.7% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |
| #5 | Claude Opus 4.7 claude-opus-4-7 • Anthropic | 1.1% | 1.1% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |
| #6 | Gemini 3.1 Pro gemini-3-1-pro • Google | 0.8% | 0.8% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |
| #7 | GLM-5.1 glm-5-1 • Zhipu AI | 0.5% | 0.5% | ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich) | May 19, 2026 | Link |