ctf challengesjeopardy-ctfpwnreverse-engineeringweb-exploitationcryptography

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

40 samples

Last Updated

April 12, 2025

Paper Details

Title

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Authors

Andy K. Zhang, Neil Perry, Riya Dulepet

+4 more

Published

April 12, 2025

arXiv ID

2408.08926

Metrics Tracked

task completion-ratesubtask completiontime to-solve

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

40 professional-level CTF tasks from 4 distinct competitions with subtask breakdowns for detailed evaluation

Number of Tasks

professional-ctfvulnerability-exploitationreverse-engineeringcryptography

Dataset Size

40 samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results

ctf challengesjeopardy-ctfpwnreverse-engineeringweb-exploitationcryptography

Cybench

A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

40 samples

Last Updated

April 12, 2025

Paper Details

Title

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Authors

Andy K. Zhang, Neil Perry, Riya Dulepet

+4 more

Published

April 12, 2025

arXiv ID

2408.08926

Metrics Tracked

task completion-ratesubtask completiontime to-solve

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

40 professional-level CTF tasks from 4 distinct competitions with subtask breakdowns for detailed evaluation

Number of Tasks

professional-ctfvulnerability-exploitationreverse-engineeringcryptography

Dataset Size

40 samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results