Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisPoc GenerationPatch ValidationVulnerability Reasoning

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

View Paper
Quick Stats

Top Score

18.0%

Models Evaluated

3

Dataset Size

200 samples

Last Updated

June 13, 2025

Availability

Dataset ✓Code ✓
Metrics Tracked
poc success-ratepatching success-rate
Sources
ProjectCode
Dataset Information

200 verified real-world CVE instances in open-source C/C++ projects with reproducible PoCs and gold patches, generated automatically by a multi-agent scaffold (Preprocessor → Verifier → Evaluator) at a cost of $0.87 per instance.

Number of Tasks

3

Poc GenerationVulnerability PatchingVulnerability Reproduction
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelpoc success-ratepatching success-rateEvaluated ByDateSource
1st
OpenHands + Claude 3.7 Sonnet
openhands-claude-3.7-sonnet • Anthropic
18.0%34.0%SEC-bench teamMay 12, 2026Link
2nd
SWE-agent + Claude 3.7 Sonnet
swe-agent-claude-3.7-sonnet • Anthropic
12.5%31.5%SEC-bench teamMay 12, 2026Link
3rd
Aider + Claude 3.7 Sonnet
aider-claude-3.7-sonnet • Anthropic
3.0%23.5%SEC-bench teamMay 12, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub