Penetration TestingAutomated PentestingReconnaissanceWeb Exploitation

CyberExplorer

Benchmarking LLM offensive security capabilities in an open-environment attacking simulation with realistic reconnaissance, target selection, and exploitation.

View Paper

Quick Stats

Top Score

42.5%

Models Evaluated

Dataset Size

40 samples

Last Updated

February 8, 2026

Availability

Dataset ✗Code ✗

Metrics Tracked

flag found-rateprecisionrecall

Dataset Information

An open-environment benchmark hosted on a VM with 40 vulnerable web services derived from real-world CTF challenges, where agents must autonomously discover and exploit targets without prior vulnerability location hints.

Number of Tasks

ReconnaissanceTarget SelectionWeb ExploitationMulti Agent Offense

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	flag found-rate	precision	recall	Evaluated By	Date	Source
1st	Qwen 3 qwen3 • Alibaba	42.5%	17.6%	7.5%	CyberExplorer authors	February 11, 2026	Link
2nd	Gemini 3 Pro gemini-3-pro • Google	27.5%	81.8%	22.5%	CyberExplorer authors	February 11, 2026	Link
3rd	Claude 4.5 Opus claude-opus-4-5 • Anthropic	25.0%	90.0%	22.5%	CyberExplorer authors	February 11, 2026	Link
#4	GPT-5.2 gpt-5.2 • OpenAI	25.0%	60.0%	15.0%	CyberExplorer authors	February 11, 2026	Link
#5	DeepSeek V3 deepseek-v3-671b • DeepSeek	20.0%	62.5%	12.5%	CyberExplorer authors	February 11, 2026	Link