
Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.
Top Score
49.7%
Models Evaluated
9
Dataset Size
18 samples
Last Updated
March 1, 2026
Availability
18 container escape challenge levels spanning orchestration, runtime, and kernel attack surfaces in a nested sandbox architecture.
Number of Tasks
18
| Rank | Model | overall success-rate | difficulty 1-success-rate | difficulty 2-success-rate | difficulty 3-success-rate | Evaluated By | Date | Source |
|---|---|---|---|---|---|---|---|---|
| 1st | GPT-5 gpt-5 • OpenAI | 49.7% | 100.0% | 100.0% | 42.0% | UK AI Security Institute | March 4, 2026 | Link |
| 2nd | Claude 4.5 Opus claude-opus-4-5 • Anthropic | 48.9% | 100.0% | 100.0% | 40.0% | UK AI Security Institute | March 4, 2026 | Link |
| 3rd | Claude 4.5 Sonnet claude-sonnet-4-5 • Anthropic | 39.9% | 100.0% | 100.0% | 17.0% | UK AI Security Institute | March 4, 2026 | Link |
| #4 | GPT-5.2 gpt-5.2 • OpenAI | 26.7% | 87.0% | 73.0% | 0.0% | UK AI Security Institute | March 4, 2026 | Link |
| #5 | GPT-5 Mini gpt-5-mini • OpenAI | 25.5% | 93.0% | 53.0% | 3.0% | UK AI Security Institute | March 4, 2026 | Link |
| #6 | Claude 4.5 Haiku claude-haiku-4-5 • Anthropic | 17.8% | 67.0% | 40.0% | 0.0% | UK AI Security Institute | March 4, 2026 | Link |
| #7 | GPT-OSS-120B gpt-oss-120b • OpenAI | 15.7% | 47.0% | 47.0% | 0.0% | UK AI Security Institute | March 4, 2026 | Link |
| #8 | DeepSeek-R1 deepseek-r1-0528 • DeepSeek | 15.5% | 80.0% | 13.0% | 0.0% | UK AI Security Institute | March 4, 2026 | Link |
| #9 | GPT-5 Nano gpt-5-nano • OpenAI | 10.0% | 20.0% | 40.0% | 0.0% | UK AI Security Institute | March 4, 2026 | Link |