
A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
Top Score
13.5%
Models Evaluated
3
Dataset Size
200 samples
Last Updated
June 8, 2024
Availability
200 validated CTF challenges from NYU CSAW competitions (2017-2023), drawn from an initial pool of 568 challenges across six categories: crypto, forensics, pwn, reverse engineering, misc, and web.
Number of Tasks
3
| Rank | Model | solve rate | Evaluated By | Date | Source |
|---|---|---|---|---|---|
| 1st | Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic | 13.5% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 2nd | GPT-4o gpt-4o • OpenAI | 9.5% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |
| 3rd | GPT-4 gpt-4-0613 • OpenAI | 7.0% | Frontier AI Cybersecurity Observatory | April 18, 2025 | Link |