
An exploit-based framework for assessing whether LLM-generated patches truly remediate real vulnerabilities rather than only looking plausible in text.
Top Score
21.7%
Models Evaluated
12
Dataset Size
23 samples
Last Updated
September 3, 2025
Title
VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities
Authors
Weizhe Wang, Wei Ma, Qiang Hu
+6 more
Published
September 3, 2025
arXiv ID
2509.0333123 curated Python CVEs with working proof-of-concept exploits, evaluated in a containerized differential pipeline where a patch only succeeds if the exploit no longer works.
Number of Tasks
exploit-verified-repairvulnerability-localizationpatch-generation
Dataset Size
23 samples
| Rank | Model | repair success-rate | patch correctness-rate | composite performance-score | Evaluated By | Date |
|---|---|---|---|---|---|---|
| 1st | Gemini 2.5 Pro gemini-2.5-pro • Google | 21.7% | 30.4% | 22.6% | VulnRepairEval authors | September 3, 2025 |
| 2nd | DeepSeek-R1 deepseek-r1-0528 • DeepSeek | 17.4% | 17.4% | 15.2% | VulnRepairEval authors | September 3, 2025 |
| 3rd | DeepSeek V3 deepseek-v3-671b • DeepSeek | 13.0% | 4.3% | 5.2% | VulnRepairEval authors | September 3, 2025 |
| #4 | Gemini 2.5 Flash gemini-2.5-flash • Google | 8.7% | 21.7% | 8.9% | VulnRepairEval authors | September 3, 2025 |
| #5 | Qwen 3 qwen3-235b-thinking • Alibaba | 8.7% | 13.0% | 8.6% | VulnRepairEval authors | September 3, 2025 |
| #6 | Gemini 2.0 Flash gemini-2.0-flash • Google | 8.7% | 56.5% | 10.4% | VulnRepairEval authors | September 3, 2025 |
| #7 | GPT o4-mini gpt-o4-mini • OpenAI | 4.3% | 4.3% | 3.1% | VulnRepairEval authors | September 3, 2025 |
| #8 | GPT-3.5 gpt-3.5-turbo-1106 • OpenAI | 4.3% | 0.0% | 0.0% | VulnRepairEval authors | September 3, 2025 |
| #9 | Qwen 3 qwen3-8b-thinking • Alibaba | 0.0% | 0.0% | 0.0% | VulnRepairEval authors | September 3, 2025 |
| #10 | GPT-4o gpt-4o • OpenAI | 0.0% | 0.0% | 0.0% | VulnRepairEval authors | September 3, 2025 |
| #11 | Qwen 3 qwen3-8b • Alibaba | 0.0% | 4.3% | 0.0% | VulnRepairEval authors | September 3, 2025 |
| #12 | Qwen 3 qwen3-235b • Alibaba | 0.0% | 13.0% | 0.0% | VulnRepairEval authors | September 3, 2025 |