Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
CTF ChallengesJeopardy CTFPwnReverse EngineeringWeb ExploitationCryptography

NYU CTF Bench

A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

View Paper
Quick Stats

Top Score

13.5%

Models Evaluated

3

Dataset Size

200 samples

Last Updated

June 8, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
solve rate
Sources
ProjectCodeLeaderboard
Dataset Information

200 validated CTF challenges from NYU CSAW competitions (2017-2023), drawn from an initial pool of 568 challenges across six categories: crypto, forensics, pwn, reverse engineering, misc, and web.

Number of Tasks

3

CTF ChallengesOffensive SecurityAutomated Task Planning
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelsolve rateEvaluated ByDateSource
1st
Claude 3.5
claude-3-5-sonnet-20241022 • Anthropic
13.5%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
2nd
GPT-4o
gpt-4o • OpenAI
9.5%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
3rd
GPT-4
gpt-4-0613 • OpenAI
7.0%Frontier AI Cybersecurity ObservatoryApril 18, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub