
Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.
Top Score
42.5%
Models Evaluated
5
Dataset Size
40 samples
Last Updated
February 8, 2026
Availability
An open-environment benchmark hosted on a VM with 40 vulnerable web services derived from real-world CTF challenges, where agents must autonomously discover and exploit targets without prior vulnerability location hints.
Number of Tasks
4
| Rank | Model | flag found-rate | precision | recall | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|
| 1st | Qwen 3 qwen3 • Alibaba | 42.5% | 17.6% | 7.5% | CyberExplorer authors | February 11, 2026 | Link |
| 2nd | Gemini 3 Pro gemini-3-pro • Google | 27.5% | 81.8% | 22.5% | CyberExplorer authors | February 11, 2026 | Link |
| 3rd | Claude 4.5 Opus claude-opus-4-5 • Anthropic | 25.0% | 90.0% | 22.5% | CyberExplorer authors | February 11, 2026 | Link |
| #4 | GPT-5.2 gpt-5.2 • OpenAI | 25.0% | 60.0% | 15.0% | CyberExplorer authors | February 11, 2026 | Link |
| #5 | DeepSeek V3 deepseek-v3-671b • DeepSeek | 20.0% | 62.5% | 12.5% | CyberExplorer authors | February 11, 2026 | Link |