
Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Top Score
18.0%
Models Evaluated
3
Dataset Size
200 samples
Last Updated
June 13, 2025
Availability
200 verified real-world CVE instances in open-source C/C++ projects with reproducible PoCs and gold patches, generated automatically by a multi-agent scaffold (Preprocessor → Verifier → Evaluator) at a cost of $0.87 per instance.
Number of Tasks
3
| Rank | Model | poc success-rate | patching success-rate | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|
| 1st | OpenHands + Claude 3.7 Sonnet openhands-claude-3.7-sonnet • Anthropic | 18.0% | 34.0% | SEC-bench team | May 12, 2026 | Link |
| 2nd | SWE-agent + Claude 3.7 Sonnet swe-agent-claude-3.7-sonnet • Anthropic | 12.5% | 31.5% | SEC-bench team | May 12, 2026 | Link |
| 3rd | Aider + Claude 3.7 Sonnet aider-claude-3.7-sonnet • Anthropic | 3.0% | 23.5% | SEC-bench team | May 12, 2026 | Link |