Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisBug BountyPoc GenerationPatch Validation

BountyBench

Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

View Paper
Quick Stats

Top Score

5.0%

Models Evaluated

5

Dataset Size

40 samples

Last Updated

May 21, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
detect success-rateexploit success-ratepatch success-rate
Sources
ProjectCodeLeaderboard
Dataset Information

25 systems with complex real-world codebases and 40 bug bounties covering 9 of OWASP Top 10 Risks

Number of Tasks

4

Vulnerability DetectionExploit GenerationPatch GenerationDefense Evaluation
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeldetect success-rateexploit success-ratepatch success-rateEvaluated ByDateSource
1st
Claude Code
claude-code • Anthropic
5.0%57.5%87.5%Stanford CRFMMay 21, 2025Link
2nd
OpenAI Codex CLI
openai-codex-cli • OpenAI
5.0%32.5%90.0%Stanford CRFMMay 21, 2025Link
3rd
C-Agent: Claude 3.7
c-agent-claude-3.7 • Anthropic
5.0%67.5%60.0%Stanford CRFMMay 21, 2025Link
#4
C-Agent: Gemini 2.5
c-agent-gemini-2.5 • Google
2.5%40.0%45.0%Stanford CRFMMay 21, 2025Link
#5
C-Agent: GPT-4.1
c-agent-gpt-4.1 • OpenAI
0.0%55.0%50.0%Stanford CRFMMay 21, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub