Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Vulnerability AnalysisPatch ValidationVulnerability Reasoning

VulnRepairEval

An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.

View Paper
Quick Stats

Top Score

21.7%

Models Evaluated

12

Dataset Size

23 samples

Last Updated

September 3, 2025

Availability

Dataset ✗Code ✗
Metrics Tracked
repair success-ratepatch correctness-ratecomposite performance-score
Dataset Information

23 curated Python CVEs with working proof-of-concept exploits, evaluated in a containerized differential pipeline where a patch only succeeds if the exploit no longer works.

Number of Tasks

3

Exploit Verified RepairVulnerability LocalizationPatch Generation
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelrepair success-ratepatch correctness-ratecomposite performance-scoreEvaluated ByDateSource
1st
Gemini 2.5 Pro
gemini-2.5-pro • Google
21.7%30.4%22.6%VulnRepairEval authorsSeptember 3, 2025Link
2nd
DeepSeek-R1
deepseek-r1-0528 • DeepSeek
17.4%17.4%15.2%VulnRepairEval authorsSeptember 3, 2025Link
3rd
DeepSeek V3
deepseek-v3-671b • DeepSeek
13.0%4.3%5.2%VulnRepairEval authorsSeptember 3, 2025Link
#4
Gemini 2.5 Flash
gemini-2.5-flash • Google
8.7%21.7%8.9%VulnRepairEval authorsSeptember 3, 2025Link
#5
Qwen 3
qwen3-235b-thinking • Alibaba
8.7%13.0%8.6%VulnRepairEval authorsSeptember 3, 2025Link
#6
Gemini 2.0 Flash
gemini-2.0-flash • Google
8.7%56.5%10.4%VulnRepairEval authorsSeptember 3, 2025Link
#7
GPT o4-mini
gpt-o4-mini • OpenAI
4.3%4.3%3.1%VulnRepairEval authorsSeptember 3, 2025Link
#8
GPT-3.5
gpt-3.5-turbo-1106 • OpenAI
4.3%0.0%0.0%VulnRepairEval authorsSeptember 3, 2025Link
#9
Qwen 3
qwen3-8b-thinking • Alibaba
0.0%0.0%0.0%VulnRepairEval authorsSeptember 3, 2025Link
#10
GPT-4o
gpt-4o • OpenAI
0.0%0.0%0.0%VulnRepairEval authorsSeptember 3, 2025Link
#11
Qwen 3
qwen3-8b • Alibaba
0.0%4.3%0.0%VulnRepairEval authorsSeptember 3, 2025Link
#12
Qwen 3
qwen3-235b • Alibaba
0.0%13.0%0.0%VulnRepairEval authorsSeptember 3, 2025Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub