LLM SafetySandbox Escape

SANDBOXESCAPEBENCH

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

View Paper

Quick Stats

Top Score

49.7%

Models Evaluated

Dataset Size

18 samples

Last Updated

March 1, 2026

Availability

Dataset ✓Code ✓

Metrics Tracked

overall success-ratedifficulty 1-success-ratedifficulty 2-success-ratedifficulty 3-success-rate

Sources

Project

Dataset Information

18 container escape challenge levels spanning orchestration, runtime, and kernel attack surfaces in a nested sandbox architecture.

Number of Tasks

Cri OKubectl CpRbacRoute LnPrivilegedDocker SockCap Sys AdmCap ModCap Dac RdHostpathRunc 2019Runc 2024Pid NsCgroupBpf PrivescDirty CowDirty PipePacket Sock

Performance Comparison

Visual comparison of model performance on this benchmark

Model Results

Detailed scores for each model evaluated on this benchmark

Rank	Model	overall success-rate	difficulty 1-success-rate	difficulty 2-success-rate	difficulty 3-success-rate	Evaluated By	Date	Source
1st	GPT-5 gpt-5 • OpenAI	49.7%	100.0%	100.0%	42.0%	UK AI Security Institute	March 4, 2026	Link
2nd	Claude 4.5 Opus claude-opus-4-5 • Anthropic	48.9%	100.0%	100.0%	40.0%	UK AI Security Institute	March 4, 2026	Link
3rd	Claude 4.5 Sonnet claude-sonnet-4-5 • Anthropic	39.9%	100.0%	100.0%	17.0%	UK AI Security Institute	March 4, 2026	Link
#4	GPT-5.2 gpt-5.2 • OpenAI	26.7%	87.0%	73.0%	0.0%	UK AI Security Institute	March 4, 2026	Link
#5	GPT-5 Mini gpt-5-mini • OpenAI	25.5%	93.0%	53.0%	3.0%	UK AI Security Institute	March 4, 2026	Link
#6	Claude 4.5 Haiku claude-haiku-4-5 • Anthropic	17.8%	67.0%	40.0%	0.0%	UK AI Security Institute	March 4, 2026	Link
#7	GPT-OSS-120B gpt-oss-120b • OpenAI	15.7%	47.0%	47.0%	0.0%	UK AI Security Institute	March 4, 2026	Link
#8	DeepSeek-R1 deepseek-r1-0528 • DeepSeek	15.5%	80.0%	13.0%	0.0%	UK AI Security Institute	March 4, 2026	Link
#9	GPT-5 Nano gpt-5-nano • OpenAI	10.0%	20.0%	40.0%	0.0%	UK AI Security Institute	March 4, 2026	Link