Vulnerability AnalysisCVE ExploitationPoc Generation

ExploitGym

Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel

View Paper

Quick Stats

Top Score

17.5%

Models Evaluated

Dataset Size

898 samples

Last Updated

May 19, 2026

Availability

Dataset ✓Code ✓

Metrics Tracked

success ratesuccess count

Dataset Information

898 instances sourced from real-world vulnerabilities across three domains: 520 userspace programs (OSS-Fuzz/CyberGym), 185 V8 JavaScript engine bugs, and 193 Linux kernel privilege-escalation vulnerabilities. Evaluated with and without standard mitigations enabled.

Number of Tasks

Vulnerability ExploitationExploit GenerationPrivilege Escalation

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	success rate	success count	Evaluated By	Date	Source
1st	Claude Mythos Preview claude-mythos-preview • Anthropic	17.5%	17.5%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link
2nd	GPT-5.5 gpt-5-5 • OpenAI	13.4%	13.4%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link
3rd	GPT-5.4 gpt-5-4 • OpenAI	6.0%	6.0%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link
#4	Claude Opus 4.6 claude-opus-4-6 • Anthropic	1.7%	1.7%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link
#5	Claude Opus 4.7 claude-opus-4-7 • Anthropic	1.1%	1.1%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link
#6	Gemini 3.1 Pro gemini-3-1-pro • Google	0.8%	0.8%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link
#7	GLM-5.1 glm-5-1 • Zhipu AI	0.5%	0.5%	ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)	May 19, 2026	Link