Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Leaderboards
  • Compare
  • Submit
Support
Back to Benchmarks
llm safetysandbox-escape

SANDBOXESCAPEBENCH

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

View PaperCompare Models
Quick Stats

Top Score

49.7%

Models Evaluated

9

Dataset Size

18 samples

Last Updated

March 4, 2026

Paper Details

Title

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Authors

Rahul Marchand, Art O Cathain, Jerome Wynne

+5 more

Published

March 4, 2026

arXiv ID

2603.02277
Metrics Tracked
overall success-ratedifficulty 1-success-ratedifficulty 2-success-ratedifficulty 3-success-rate
Availability
Dataset AvailableYes
Code AvailableYes
Dataset Information

18 container escape challenge levels spanning orchestration, runtime, and kernel attack surfaces in a nested sandbox architecture.

Number of Tasks

cri-okubectl-cprbacroute-lnprivilegeddocker-sockcap-sys-admcap-modcap-dac-rdhostpathrunc-2019runc-2024pid-nscgroupbpf-privescdirty-cowdirty-pipepacket-sock

Dataset Size

18 samples

Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeloverall success-ratedifficulty 1-success-ratedifficulty 2-success-ratedifficulty 3-success-rateEvaluated ByDate
1st
GPT-5
gpt-5 • OpenAI
49.7%100.0%100.0%42.0%UK AI Security InstituteMarch 4, 2026
2nd
Claude 4.5 Opus
claude-opus-4-5 • Anthropic
48.9%100.0%100.0%40.0%UK AI Security InstituteMarch 4, 2026
3rd
Claude 4.5 Sonnet
claude-sonnet-4-5 • Anthropic
39.9%100.0%100.0%17.0%UK AI Security InstituteMarch 4, 2026
#4
GPT-5.2
gpt-5.2 • OpenAI
26.7%87.0%73.0%0.0%UK AI Security InstituteMarch 4, 2026
#5
GPT-5 Mini
gpt-5-mini • OpenAI
25.5%93.0%53.0%3.0%UK AI Security InstituteMarch 4, 2026
#6
Claude 4.5 Haiku
claude-haiku-4-5 • Anthropic
17.8%67.0%40.0%0.0%UK AI Security InstituteMarch 4, 2026
#7
GPT-OSS-120B
gpt-oss-120b • OpenAI
15.7%47.0%47.0%0.0%UK AI Security InstituteMarch 4, 2026
#8
DeepSeek-R1
deepseek-r1-0528 • DeepSeek
15.5%80.0%13.0%0.0%UK AI Security InstituteMarch 4, 2026
#9
GPT-5 Nano
gpt-5-nano • OpenAI
10.0%20.0%40.0%0.0%UK AI Security InstituteMarch 4, 2026