Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Penetration TestingAutomated PentestingExploit GenerationVulnerability Discovery

AutoPenBench

Benchmarking Generative Agents for Penetration Testing with 33 vulnerable systems

View Paper
Quick Stats

Top Score

64.0%

Models Evaluated

2

Dataset Size

33 samples

Last Updated

October 4, 2024

Availability

Dataset ✓Code ✓
Metrics Tracked
success rate
Sources
ProjectLeaderboard
Dataset Information

33 vulnerable systems of increasing difficulty including in-vitro and real-world scenarios with generic and specific milestone evaluation

Number of Tasks

4

Vulnerability ExploitationPrivilege EscalationNetwork PenetrationWeb Exploitation
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelsuccess rateEvaluated ByDateSource
1st
AutoPenBench Assisted Agent
gpt-4o-assisted • OpenAI
64.0%AutoPenBench authorsOctober 28, 2024Link
2nd
AutoPenBench Autonomous Agent
gpt-4o-autonomous • OpenAI
21.0%AutoPenBench authorsOctober 28, 2024Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub