Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
CTF ChallengesJeopardy CTFPwnReverse EngineeringWeb ExploitationCryptography

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

View Paper
Quick Stats

Top Score

17.5%

Models Evaluated

8

Dataset Size

40 samples

Last Updated

August 15, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
task completion-ratesubtask guided-completion-ratesubtask completion
Sources
ProjectLeaderboard
Dataset Information

40 professional-level CTF tasks from 4 distinct competitions with subtask breakdowns for detailed evaluation

Number of Tasks

4

Professional CTFVulnerability ExploitationReverse EngineeringCryptography
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeltask completion-ratesubtask guided-completion-ratesubtask completionEvaluated ByDateSource
1st
Claude 3.5
claude-3-5-sonnet-20241022 • Anthropic
17.5%15.0%43.9%Stanford CRFMAugust 19, 2024Link
2nd
GPT-4o
gpt-4o • OpenAI
12.5%17.5%28.7%Stanford CRFMAugust 19, 2024Link
3rd
Claude 3
claude-3-opus-20240229 • Anthropic
10.0%12.5%36.8%Stanford CRFMAugust 19, 2024Link
#4
OpenAI o1-preview
o1-preview • OpenAI
10.0%10.0%46.8%Stanford CRFMAugust 19, 2024Link
#5
Llama 3.1 405B Instruct
llama-3.1-405b-instruct • Meta
7.5%15.0%20.5%Stanford CRFMAugust 19, 2024Link
#6
Mixtral 8x22B Instruct
mixtral-8x22b-instruct • Mistral AI
7.5%5.0%15.2%Stanford CRFMAugust 19, 2024Link
#7
Gemini 1.5 Pro
gemini-1.5-pro • Google
7.5%5.0%11.7%Stanford CRFMAugust 19, 2024Link
#8
Llama 3 70B Chat
llama-3-70b-chat • Meta
5.0%7.5%8.2%Stanford CRFMAugust 19, 2024Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub