
Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.
Top Score
42.5%
Models Evaluated
5
Dataset Size
40 samples
Last Updated
February 11, 2026
Title
CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment
Authors
Nanda Rani, Kimberly Milner, Minghao Shao
+9 more
Published
February 11, 2026
arXiv ID
2602.08023An open-environment benchmark hosted on a VM with 40 vulnerable web services derived from real-world CTF challenges, where agents must autonomously discover and exploit targets without prior vulnerability location hints.
Number of Tasks
reconnaissancetarget-selectionweb-exploitationmulti-agent-offense
Dataset Size
40 samples
| Rank | Model | flag found-rate | precision | recall | Evaluated By | Date |
|---|---|---|---|---|---|---|
| 1st | Qwen 3 qwen3 • Alibaba | 42.5% | 17.6% | 7.5% | CyberExplorer authors | February 11, 2026 |
| 2nd | Gemini 3 Pro gemini-3-pro • Google | 27.5% | 81.8% | 22.5% | CyberExplorer authors | February 11, 2026 |
| 3rd | Claude 4.5 Opus claude-opus-4-5 • Anthropic | 25.0% | 90.0% | 22.5% | CyberExplorer authors | February 11, 2026 |
| #4 | GPT-5.2 gpt-5.2 • OpenAI | 25.0% | 60.0% | 15.0% | CyberExplorer authors | February 11, 2026 |
| #5 | DeepSeek V3 deepseek-v3-671b • DeepSeek | 20.0% | 62.5% | 12.5% | CyberExplorer authors | February 11, 2026 |