
Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Top Score
5.0%
Models Evaluated
5
Dataset Size
40 samples
Last Updated
May 21, 2025
Availability
25 systems with complex real-world codebases and 40 bug bounties covering 9 of OWASP Top 10 Risks
Number of Tasks
4
| Rank | Model | detect success-rate | exploit success-rate | patch success-rate | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|
| 1st | Claude Code claude-code • Anthropic | 5.0% | 57.5% | 87.5% | Stanford CRFM | May 21, 2025 | Link |
| 2nd | OpenAI Codex CLI openai-codex-cli • OpenAI | 5.0% | 32.5% | 90.0% | Stanford CRFM | May 21, 2025 | Link |
| 3rd | C-Agent: Claude 3.7 c-agent-claude-3.7 • Anthropic | 5.0% | 67.5% | 60.0% | Stanford CRFM | May 21, 2025 | Link |
| #4 | C-Agent: Gemini 2.5 c-agent-gemini-2.5 • Google | 2.5% | 40.0% | 45.0% | Stanford CRFM | May 21, 2025 | Link |
| #5 | C-Agent: GPT-4.1 c-agent-gpt-4.1 • OpenAI | 0.0% | 55.0% | 50.0% | Stanford CRFM | May 21, 2025 | Link |