Vulnerability AnalysisPoc GenerationPatch ValidationVulnerability Reasoning

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

View Paper

Quick Stats

Top Score

18.0%

Models Evaluated

Dataset Size

200 samples

Last Updated

June 13, 2025

Availability

Dataset ✓Code ✓

Metrics Tracked

poc success-ratepatching success-rate

Sources

Project Code

Dataset Information

200 verified real-world CVE instances in open-source C/C++ projects with reproducible PoCs and gold patches, generated automatically by a multi-agent scaffold (Preprocessor → Verifier → Evaluator) at a cost of $0.87 per instance.

Number of Tasks

Poc GenerationVulnerability PatchingVulnerability Reproduction

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	poc success-rate	patching success-rate	Evaluated By	Date	Source
1st	OpenHands + Claude 3.7 Sonnet openhands-claude-3.7-sonnet • Anthropic	18.0%	34.0%	SEC-bench team	May 12, 2026	Link
2nd	SWE-agent + Claude 3.7 Sonnet swe-agent-claude-3.7-sonnet • Anthropic	12.5%	31.5%	SEC-bench team	May 12, 2026	Link
3rd	Aider + Claude 3.7 Sonnet aider-claude-3.7-sonnet • Anthropic	3.0%	23.5%	SEC-bench team	May 12, 2026	Link