
Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.
Top Score
49.7%
Models Evaluated
9
Dataset Size
18 samples
Last Updated
March 4, 2026
Title
Quantifying Frontier LLM Capabilities for Container Sandbox Escape
Authors
Rahul Marchand, Art O Cathain, Jerome Wynne
+5 more
Published
March 4, 2026
arXiv ID
2603.0227718 container escape challenge levels spanning orchestration, runtime, and kernel attack surfaces in a nested sandbox architecture.
Number of Tasks
cri-okubectl-cprbacroute-lnprivilegeddocker-sockcap-sys-admcap-modcap-dac-rdhostpathrunc-2019runc-2024pid-nscgroupbpf-privescdirty-cowdirty-pipepacket-sock
Dataset Size
18 samples
| Rank | Model | overall success-rate | difficulty 1-success-rate | difficulty 2-success-rate | difficulty 3-success-rate | Evaluated By | Date |
|---|---|---|---|---|---|---|---|
| 1st | GPT-5 gpt-5 • OpenAI | 49.7% | 100.0% | 100.0% | 42.0% | UK AI Security Institute | March 4, 2026 |
| 2nd | Claude 4.5 Opus claude-opus-4-5 • Anthropic | 48.9% | 100.0% | 100.0% | 40.0% | UK AI Security Institute | March 4, 2026 |
| 3rd | Claude 4.5 Sonnet claude-sonnet-4-5 • Anthropic | 39.9% | 100.0% | 100.0% | 17.0% | UK AI Security Institute | March 4, 2026 |
| #4 | GPT-5.2 gpt-5.2 • OpenAI | 26.7% | 87.0% | 73.0% | 0.0% | UK AI Security Institute | March 4, 2026 |
| #5 | GPT-5 Mini gpt-5-mini • OpenAI | 25.5% | 93.0% | 53.0% | 3.0% | UK AI Security Institute | March 4, 2026 |
| #6 | Claude 4.5 Haiku claude-haiku-4-5 • Anthropic | 17.8% | 67.0% | 40.0% | 0.0% | UK AI Security Institute | March 4, 2026 |
| #7 | GPT-OSS-120B gpt-oss-120b • OpenAI | 15.7% | 47.0% | 47.0% | 0.0% | UK AI Security Institute | March 4, 2026 |
| #8 | DeepSeek-R1 deepseek-r1-0528 • DeepSeek | 15.5% | 80.0% | 13.0% | 0.0% | UK AI Security Institute | March 4, 2026 |
| #9 | GPT-5 Nano gpt-5-nano • OpenAI | 10.0% | 20.0% | 40.0% | 0.0% | UK AI Security Institute | March 4, 2026 |