CTF ChallengesJeopardy CTFPwnReverse EngineeringWeb ExploitationCryptography

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

View Paper

Quick Stats

Top Score

17.5%

Models Evaluated

Dataset Size

40 samples

Last Updated

August 15, 2024

Availability

Dataset ✓Code ✓

Metrics Tracked

task completion-ratesubtask guided-completion-ratesubtask completion

Sources

Project Leaderboard

Dataset Information

40 professional-level CTF tasks from 4 distinct competitions with subtask breakdowns for detailed evaluation

Number of Tasks

Professional CTFVulnerability ExploitationReverse EngineeringCryptography

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	task completion-rate	subtask guided-completion-rate	subtask completion	Evaluated By	Date	Source
1st	Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic	17.5%	15.0%	43.9%	Stanford CRFM	August 19, 2024	Link
2nd	GPT-4o gpt-4o • OpenAI	12.5%	17.5%	28.7%	Stanford CRFM	August 19, 2024	Link
3rd	Claude 3 claude-3-opus-20240229 • Anthropic	10.0%	12.5%	36.8%	Stanford CRFM	August 19, 2024	Link
#4	OpenAI o1-preview o1-preview • OpenAI	10.0%	10.0%	46.8%	Stanford CRFM	August 19, 2024	Link
#5	Llama 3.1 405B Instruct llama-3.1-405b-instruct • Meta	7.5%	15.0%	20.5%	Stanford CRFM	August 19, 2024	Link
#6	Mixtral 8x22B Instruct mixtral-8x22b-instruct • Mistral AI	7.5%	5.0%	15.2%	Stanford CRFM	August 19, 2024	Link
#7	Gemini 1.5 Pro gemini-1.5-pro • Google	7.5%	5.0%	11.7%	Stanford CRFM	August 19, 2024	Link
#8	Llama 3 70B Chat llama-3-70b-chat • Meta	5.0%	7.5%	8.2%	Stanford CRFM	August 19, 2024	Link