Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
vulnerability analysispatch-validationvulnerability-reasoning

VulnRepairEval

An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.

View PaperCompare Models
Quick Stats

Top Score

21.7%

Models Evaluated

12

Dataset Size

23 samples

Last Updated

September 3, 2025

Paper Details

Title

VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities

Authors

Weizhe Wang, Wei Ma, Qiang Hu

+6 more

Published

September 3, 2025

arXiv ID

2509.03331
Metrics Tracked
repair success-ratepatch correctness-ratecomposite performance-score
Availability
Dataset AvailableNo
Code AvailableNo
Dataset Information

23 curated Python CVEs with working proof-of-concept exploits, evaluated in a containerized differential pipeline where a patch only succeeds if the exploit no longer works.

Number of Tasks

exploit-verified-repairvulnerability-localizationpatch-generation

Dataset Size

23 samples

Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelrepair success-ratepatch correctness-ratecomposite performance-scoreEvaluated ByDate
1st
Gemini 2.5 Pro
gemini-2.5-pro • Google
21.7%30.4%22.6%VulnRepairEval authorsSeptember 3, 2025
2nd
DeepSeek-R1
deepseek-r1-0528 • DeepSeek
17.4%17.4%15.2%VulnRepairEval authorsSeptember 3, 2025
3rd
DeepSeek V3
deepseek-v3-671b • DeepSeek
13.0%4.3%5.2%VulnRepairEval authorsSeptember 3, 2025
#4
Gemini 2.5 Flash
gemini-2.5-flash • Google
8.7%21.7%8.9%VulnRepairEval authorsSeptember 3, 2025
#5
Qwen 3
qwen3-235b-thinking • Alibaba
8.7%13.0%8.6%VulnRepairEval authorsSeptember 3, 2025
#6
Gemini 2.0 Flash
gemini-2.0-flash • Google
8.7%56.5%10.4%VulnRepairEval authorsSeptember 3, 2025
#7
GPT o4-mini
gpt-o4-mini • OpenAI
4.3%4.3%3.1%VulnRepairEval authorsSeptember 3, 2025
#8
GPT-3.5
gpt-3.5-turbo-1106 • OpenAI
4.3%0.0%0.0%VulnRepairEval authorsSeptember 3, 2025
#9
Qwen 3
qwen3-8b-thinking • Alibaba
0.0%0.0%0.0%VulnRepairEval authorsSeptember 3, 2025
#10
GPT-4o
gpt-4o • OpenAI
0.0%0.0%0.0%VulnRepairEval authorsSeptember 3, 2025
#11
Qwen 3
qwen3-8b • Alibaba
0.0%4.3%0.0%VulnRepairEval authorsSeptember 3, 2025
#12
Qwen 3
qwen3-235b • Alibaba
0.0%13.0%0.0%VulnRepairEval authorsSeptember 3, 2025