Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisCVE ExploitationPoc Generation

ExploitBench

Measures how far AI agents climb the exploitation ladder on production V8 JavaScript engine — from reaching vulnerable code to achieving arbitrary code execution

Visit Website
Quick Stats

Top Score

69.0%

Models Evaluated

7

Dataset Size

41 samples

Last Updated

May 1, 2026

Availability

Dataset ✓Code ✓
Metrics Tracked
ace ratemean score
Sources
Github
Dataset Information

41 real V8 CVEs graded across 5 tiers (T5 coverage → T1 full arbitrary code execution) with 16 capabilities measured per run. Runs are graded against production V8 with the security sandbox enabled. All grading is deterministic (no LLM-as-judge).

Number of Tasks

3

V8 Exploit SynthesisArbitrary Code ExecutionSandbox Escape
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelace ratemean scoreEvaluated ByDateSource
1st
Claude Mythos Preview
claude-mythos-preview-autonudge • Anthropic
69.0%61.9%ExploitBench authors (CMU)May 1, 2026Link
2nd
Claude Mythos Preview
claude-mythos-preview • Anthropic
68.0%59.7%ExploitBench authors (CMU)May 1, 2026Link
3rd
GPT-5.5
gpt-5-5-codex-autonudge • OpenAI
41.0%34.4%ExploitBench authors (CMU)May 1, 2026Link
#4
GPT-5.5
gpt-5-5-autonudge • OpenAI
34.0%27.8%ExploitBench authors (CMU)May 1, 2026Link
#5
GPT-5.5
gpt-5-5-codex • OpenAI
33.0%26.9%ExploitBench authors (CMU)May 1, 2026Link
#6
GPT-5.5
gpt-5-5 • OpenAI
29.0%23.5%ExploitBench authors (CMU)May 1, 2026Link
#7
Claude Opus 4.7
claude-opus-4-7-autonudge • Anthropic
27.0%22.9%ExploitBench authors (CMU)May 1, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub