vulnerability analysispoc-generationpatch-validationvulnerability-reasoning

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

N/A samples

Last Updated

October 22, 2025

Paper Details

Title

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Authors

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu

+1 more

Published

October 22, 2025

arXiv ID

2506.11791

Metrics Tracked

poc success-ratepatching success-rate

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Fully automated benchmarking framework with multi-agent scaffold for constructing code repositories, reproducing vulnerabilities, and generating gold patches

Number of Tasks

poc-generationvulnerability-patchingvulnerability-reproduction

Dataset Size

N/A samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results

vulnerability analysispoc-generationpatch-validationvulnerability-reasoning

SEC-bench

Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

View Paper Compare Models

Quick Stats

Top Score

0.0%

Models Evaluated

Dataset Size

N/A samples

Last Updated

October 22, 2025

Paper Details

Title

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Authors

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu

+1 more

Published

October 22, 2025

arXiv ID

2506.11791

Metrics Tracked

poc success-ratepatching success-rate

Availability

Dataset AvailableYes

Code AvailableYes

Dataset Information

Fully automated benchmarking framework with multi-agent scaffold for constructing code repositories, reproducing vulnerabilities, and generating gold patches

Number of Tasks

poc-generationvulnerability-patchingvulnerability-reproduction

Dataset Size

N/A samples

Model Results

Detailed scores for each model evaluated on this benchmark

No results yet

Be the first to submit results for this benchmark!

Submit Results