
A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Top Score
17.5%
Models Evaluated
8
Dataset Size
40 samples
Last Updated
August 15, 2024
Availability
40 professional-level CTF tasks from 4 distinct competitions with subtask breakdowns for detailed evaluation
Number of Tasks
4
| Rank | Model | task completion-rate | subtask guided-completion-rate | subtask completion | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|
| 1st | Claude 3.5 claude-3-5-sonnet-20241022 • Anthropic | 17.5% | 15.0% | 43.9% | Stanford CRFM | August 19, 2024 | Link |
| 2nd | GPT-4o gpt-4o • OpenAI | 12.5% | 17.5% | 28.7% | Stanford CRFM | August 19, 2024 | Link |
| 3rd | Claude 3 claude-3-opus-20240229 • Anthropic | 10.0% | 12.5% | 36.8% | Stanford CRFM | August 19, 2024 | Link |
| #4 | OpenAI o1-preview o1-preview • OpenAI | 10.0% | 10.0% | 46.8% | Stanford CRFM | August 19, 2024 | Link |
| #5 | Llama 3.1 405B Instruct llama-3.1-405b-instruct • Meta | 7.5% | 15.0% | 20.5% | Stanford CRFM | August 19, 2024 | Link |
| #6 | Mixtral 8x22B Instruct mixtral-8x22b-instruct • Mistral AI | 7.5% | 5.0% | 15.2% | Stanford CRFM | August 19, 2024 | Link |
| #7 | Gemini 1.5 Pro gemini-1.5-pro • Google | 7.5% | 5.0% | 11.7% | Stanford CRFM | August 19, 2024 | Link |
| #8 | Llama 3 70B Chat llama-3-70b-chat • Meta | 5.0% | 7.5% | 8.2% | Stanford CRFM | August 19, 2024 | Link |