Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisPoc GenerationVulnerability ReasoningCVE Exploitation

CyberGym

Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

View Paper
Quick Stats

Top Score

17.8%

Models Evaluated

9

Dataset Size

1,507 samples

Last Updated

June 3, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
vulnerability reproduction-ratepost patch-vulnerability-rate
Sources
ProjectCodeDatasetLeaderboard
Dataset Information

1,507 historical vulnerabilities from 188 large software projects sourced from OSS-Fuzz continuous fuzzing campaign

Number of Tasks

3

Vulnerability ReproductionPoc GenerationOpen Ended Discovery
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelvulnerability reproduction-ratepost patch-vulnerability-rateEvaluated ByDateSource
1st
OpenHands + Claude Sonnet 4
openhands-claude-sonnet-4 • Anthropic
17.8%2.0%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
2nd
OpenHands + Claude 3.7 Sonnet
openhands-claude-3.7-sonnet • Anthropic
11.9%2.2%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
3rd
OpenHands + GPT-4.1
openhands-gpt-4.1 • OpenAI
9.4%1.3%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
#4
Cybench + GPT-4.1
cybench-gpt-4.1 • OpenAI
9.0%2.3%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
#5
Codex + GPT-4.1
codex-gpt-4.1 • OpenAI
7.4%1.2%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
#6
ENiGMA + GPT-4.1
enigma-gpt-4.1 • OpenAI
7.2%1.9%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
#7
OpenHands + Gemini 2.5 Flash
openhands-gemini-2.5-flash • Google
4.8%0.8%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
#8
OpenHands + DeepSeek V3
openhands-deepseek-v3 • DeepSeek
3.6%0.7%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
#9
OpenHands + GPT o4-mini
openhands-gpt-o4-mini • OpenAI
2.5%0.1%Frontier AI Cybersecurity ObservatoryJune 11, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub