Penetration TestingAutomated PentestingExploit GenerationVulnerability Discovery

AutoPenBench

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

View Paper

Quick Stats

Top Score

64.0%

Models Evaluated

Dataset Size

33 samples

Last Updated

October 4, 2024

Availability

Dataset ✓Code ✓

Metrics Tracked

success rate

Sources

Project Leaderboard

Dataset Information

33 vulnerable systems of increasing difficulty including in-vitro and real-world scenarios with generic and specific milestone evaluation

Number of Tasks

Vulnerability ExploitationPrivilege EscalationNetwork PenetrationWeb Exploitation

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	success rate	Evaluated By	Date	Source
1st	AutoPenBench Assisted Agent gpt-4o-assisted • OpenAI	64.0%	AutoPenBench authors	October 28, 2024	Link
2nd	AutoPenBench Autonomous Agent gpt-4o-autonomous • OpenAI	21.0%	AutoPenBench authors	October 28, 2024	Link