Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
Penetration TestingAutomated PentestingReconnaissanceWeb Exploitation

CyberExplorer

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

View Paper
Quick Stats

Top Score

42.5%

Models Evaluated

5

Dataset Size

40 samples

Last Updated

February 8, 2026

Availability

Dataset ✗Code ✗
Metrics Tracked
flag found-rateprecisionrecall
Dataset Information

An open-environment benchmark hosted on a VM with 40 vulnerable web services derived from real-world CTF challenges, where agents must autonomously discover and exploit targets without prior vulnerability location hints.

Number of Tasks

4

ReconnaissanceTarget SelectionWeb ExploitationMulti Agent Offense
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModelflag found-rateprecisionrecallEvaluated ByDateSource
1st
Qwen 3
qwen3 • Alibaba
42.5%17.6%7.5%CyberExplorer authorsFebruary 11, 2026Link
2nd
Gemini 3 Pro
gemini-3-pro • Google
27.5%81.8%22.5%CyberExplorer authorsFebruary 11, 2026Link
3rd
Claude 4.5 Opus
claude-opus-4-5 • Anthropic
25.0%90.0%22.5%CyberExplorer authorsFebruary 11, 2026Link
#4
GPT-5.2
gpt-5.2 • OpenAI
25.0%60.0%15.0%CyberExplorer authorsFebruary 11, 2026Link
#5
DeepSeek V3
deepseek-v3-671b • DeepSeek
20.0%62.5%12.5%CyberExplorer authorsFebruary 11, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub