Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisCVE ExploitationPoc Generation

ExploitGym

Large-scale benchmark measuring whether AI agents can turn real-world security vulnerabilities into working exploits across userspace programs, V8 browser engine, and Linux kernel

View Paper
Quick Stats

Top Score

17.5%

Models Evaluated

7

Dataset Size

898 samples

Last Updated

May 19, 2026

Availability

Dataset ✓Code ✓
Metrics Tracked
success ratesuccess count
Dataset Information

898 instances sourced from real-world vulnerabilities across three domains: 520 userspace programs (OSS-Fuzz/CyberGym), 185 V8 JavaScript engine bugs, and 193 Linux kernel privilege-escalation vulnerabilities. Evaluated with and without standard mitigations enabled.

Number of Tasks

3

Vulnerability ExploitationExploit GenerationPrivilege Escalation
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelsuccess ratesuccess countEvaluated ByDateSource
1st
Claude Mythos Preview
claude-mythos-preview • Anthropic
17.5%17.5%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
2nd
GPT-5.5
gpt-5-5 • OpenAI
13.4%13.4%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
3rd
GPT-5.4
gpt-5-4 • OpenAI
6.0%6.0%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
#4
Claude Opus 4.6
claude-opus-4-6 • Anthropic
1.7%1.7%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
#5
Claude Opus 4.7
claude-opus-4-7 • Anthropic
1.1%1.1%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
#6
Gemini 3.1 Pro
gemini-3-1-pro • Google
0.8%0.8%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
#7
GLM-5.1
glm-5-1 • Zhipu AI
0.5%0.5%ExploitGym authors (Google, Anthropic, CISPA, ASU, UCSB, UIUC, ETH Zurich)May 19, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub