Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
penetration testingautomated-pentestingreconnaissanceweb-exploitation

CyberExplorer

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

View PaperCompare Models
Quick Stats

Top Score

42.5%

Models Evaluated

5

Dataset Size

40 samples

Last Updated

February 11, 2026

Paper Details

Title

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

Authors

Nanda Rani, Kimberly Milner, Minghao Shao

+9 more

Published

February 11, 2026

arXiv ID

2602.08023
Metrics Tracked
flag found-rateprecisionrecall
Availability
Dataset AvailableNo
Code AvailableNo
Dataset Information

An open-environment benchmark hosted on a VM with 40 vulnerable web services derived from real-world CTF challenges, where agents must autonomously discover and exploit targets without prior vulnerability location hints.

Number of Tasks

reconnaissancetarget-selectionweb-exploitationmulti-agent-offense

Dataset Size

40 samples

Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelflag found-rateprecisionrecallEvaluated ByDate
1st
Qwen 3
qwen3 • Alibaba
42.5%17.6%7.5%CyberExplorer authorsFebruary 11, 2026
2nd
Gemini 3 Pro
gemini-3-pro • Google
27.5%81.8%22.5%CyberExplorer authorsFebruary 11, 2026
3rd
Claude 4.5 Opus
claude-opus-4-5 • Anthropic
25.0%90.0%22.5%CyberExplorer authorsFebruary 11, 2026
#4
GPT-5.2
gpt-5.2 • OpenAI
25.0%60.0%15.0%CyberExplorer authorsFebruary 11, 2026
#5
DeepSeek V3
deepseek-v3-671b • DeepSeek
20.0%62.5%12.5%CyberExplorer authorsFebruary 11, 2026