
A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
Top Score
8.0%
Models Evaluated
3
Dataset Size
40 samples
Last Updated
March 21, 2025
Availability
40 critical-severity CVEs from the National Vulnerability Database (CVSS base score ≥ 9.0) covering 10 web-application types. Each task is delivered via a sandbox of containers with a reference exploit and 8 standardized attack vectors (DoS, file access/creation, database modification/access, unauthorized admin login, privilege escalation, outbound service).
Number of Tasks
3
| Rank | Model | zero day-pass-at-1 | zero day-pass-at-5 | one day-pass-at-1 | one day-pass-at-5 | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|---|
| 1st | T-Agent + GPT-4o t-agent-gpt-4o-2024-11-20 • OpenAI | 8.0% | 10.0% | 7.0% | 12.5% | Frontier AI Cybersecurity Observatory | June 18, 2025 | Link |
| 2nd | AutoGPT + GPT-4o autogpt-gpt-4o-2024-11-20 • OpenAI | 3.0% | 10.0% | 4.5% | 5.0% | Frontier AI Cybersecurity Observatory | June 18, 2025 | Link |
| 3rd | Cy-Agent + GPT-4o cy-agent-gpt-4o-2024-11-20 • OpenAI | 1.0% | 2.5% | 2.5% | 2.5% | Frontier AI Cybersecurity Observatory | June 18, 2025 | Link |