Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Cyber LLM Benchmark Hub Logo
Cyber LLM Benchmark Hub
  • Home
  • Benchmarks
  • Contact
Support
Back to Benchmarks
LLM SafetySandbox Escape

SANDBOXESCAPEBENCH

Quantifying frontier LLM capabilities for container sandbox escape using an Inspect-based sandbox-in-a-sandbox evaluation.

View Paper
Quick Stats

Top Score

49.7%

Models Evaluated

9

Dataset Size

18 samples

Last Updated

March 1, 2026

Availability

Dataset ✓Code ✓
Metrics Tracked
overall success-ratedifficulty 1-success-ratedifficulty 2-success-ratedifficulty 3-success-rate
Sources
Project
Dataset Information

18 container escape challenge levels spanning orchestration, runtime, and kernel attack surfaces in a nested sandbox architecture.

Number of Tasks

18

Cri OKubectl CpRbacRoute LnPrivilegedDocker SockCap Sys AdmCap ModCap Dac RdHostpathRunc 2019Runc 2024Pid NsCgroupBpf PrivescDirty CowDirty PipePacket Sock
Performance Comparison
Visual comparison of model performance on this benchmark
Model Results
Detailed scores for each model evaluated on this benchmark
RankModeloverall success-ratedifficulty 1-success-ratedifficulty 2-success-ratedifficulty 3-success-rateEvaluated ByDateSource
1st
GPT-5
gpt-5 • OpenAI
49.7%100.0%100.0%42.0%UK AI Security InstituteMarch 4, 2026Link
2nd
Claude 4.5 Opus
claude-opus-4-5 • Anthropic
48.9%100.0%100.0%40.0%UK AI Security InstituteMarch 4, 2026Link
3rd
Claude 4.5 Sonnet
claude-sonnet-4-5 • Anthropic
39.9%100.0%100.0%17.0%UK AI Security InstituteMarch 4, 2026Link
#4
GPT-5.2
gpt-5.2 • OpenAI
26.7%87.0%73.0%0.0%UK AI Security InstituteMarch 4, 2026Link
#5
GPT-5 Mini
gpt-5-mini • OpenAI
25.5%93.0%53.0%3.0%UK AI Security InstituteMarch 4, 2026Link
#6
Claude 4.5 Haiku
claude-haiku-4-5 • Anthropic
17.8%67.0%40.0%0.0%UK AI Security InstituteMarch 4, 2026Link
#7
GPT-OSS-120B
gpt-oss-120b • OpenAI
15.7%47.0%47.0%0.0%UK AI Security InstituteMarch 4, 2026Link
#8
DeepSeek-R1
deepseek-r1-0528 • DeepSeek
15.5%80.0%13.0%0.0%UK AI Security InstituteMarch 4, 2026Link
#9
GPT-5 Nano
gpt-5-nano • OpenAI
10.0%20.0%40.0%0.0%UK AI Security InstituteMarch 4, 2026Link
Cyber LLM Benchmark Hub

Cyber LLM Benchmark Hub

Benchmarking frontier models across cybersecurity tasks.

BenchmarksContact

© 2026 Cyber LLM Benchmark Hub