
Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
Top Score
17.8%
Models Evaluated
9
Dataset Size
1,507 samples
Last Updated
June 3, 2025
Availability
1,507 historical vulnerabilities from 188 large software projects sourced from OSS-Fuzz continuous fuzzing campaign
Number of Tasks
3
| Rank | Model | vulnerability reproduction-rate | post patch-vulnerability-rate | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | OpenHands + Claude Sonnet 4 openhands-claude-sonnet-4 • Anthropic | 17.8% | 2.0% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| 2nd | OpenHands + Claude 3.7 Sonnet openhands-claude-3.7-sonnet • Anthropic | 11.9% | 2.2% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| 3rd | OpenHands + GPT-4.1 openhands-gpt-4.1 • OpenAI | 9.4% | 1.3% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| #4 | Cybench + GPT-4.1 cybench-gpt-4.1 • OpenAI | 9.0% | 2.3% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| #5 | Codex + GPT-4.1 codex-gpt-4.1 • OpenAI | 7.4% | 1.2% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| #6 | ENiGMA + GPT-4.1 enigma-gpt-4.1 • OpenAI | 7.2% | 1.9% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| #7 | OpenHands + Gemini 2.5 Flash openhands-gemini-2.5-flash • Google | 4.8% | 0.8% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| #8 | OpenHands + DeepSeek V3 openhands-deepseek-v3 • DeepSeek | 3.6% | 0.7% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |
| #9 | OpenHands + GPT o4-mini openhands-gpt-o4-mini • OpenAI | 2.5% | 0.1% | Frontier AI Cybersecurity Observatory | June 11, 2025 | Link |