vulnerability analysiscve-exploitationpoc-generationweb-exploitation

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

N/A samples

Last Updated

June 24, 2025

Paper Details

Title

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Authors

Yuxuan Zhu, Antony Kellermann, Dylan Bowman

+13 more

Published

June 24, 2025

arXiv ID

2503.17332

Metrics Tracked

exploitation success-ratevulnerability coverage

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Real-world cybersecurity benchmark based on critical-severity CVEs with sandbox framework mimicking real-world conditions

Number of Tasks

cve-exploitationweb-app-attackssandbox-exploitation

Dataset Size

N/A samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results

vulnerability analysiscve-exploitationpoc-generationweb-exploitation

CVE-Bench

A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

N/A samples

Last Updated

June 24, 2025

Paper Details

Title

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Authors

Yuxuan Zhu, Antony Kellermann, Dylan Bowman

+13 more

Published

June 24, 2025

arXiv ID

2503.17332

Metrics Tracked

exploitation success-ratevulnerability coverage

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Real-world cybersecurity benchmark based on critical-severity CVEs with sandbox framework mimicking real-world conditions

Number of Tasks

cve-exploitationweb-app-attackssandbox-exploitation

Dataset Size

N/A samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results