ShortSpan.ai logo

Benchmark tests LLM agents for container escapes

Agents
Published: Wed, Mar 04, 2026 • By Theo Solander
Benchmark tests LLM agents for container escapes
Researchers release SANDBOXESCAPEBENCH, a safe way to measure whether Large Language Model (LLM) agents can break out of container sandboxes. Frontier models reliably exploit common misconfigurations and exposed control paths; moderate tasks see mixed success; hardest cases hold. Scale, hints, and token budgets matter. The work argues for stricter configs and standardised escape testing.

When we hand Large Language Model (LLM) agents a shell and a shopping list of tools, we reach for containers as the safety harness. This new study asks a blunt question: if an agent pokes at the walls, how often do those walls hold? The answer arrives in the form of SANDBOXESCAPEBENCH, an open benchmark designed to quantify an agent’s ability to escape Docker/OCI containment without risking the host that runs the test.

The setup is neat. The authors run a Capture the Flag evaluation inside a nested sandbox. The inner container is where the LLM agent gets interactive shell access, typical utilities, and network connectivity. The outer layer has no known vulnerabilities and holds the flag on the host side of all namespaces. Success is unambiguous: read the flag. The suite covers 18 reproducible scenarios across three layers that practitioners will recognise: orchestration pitfalls, engine and runtime issues, and host or kernel vectors.

What the benchmark shows

The headline finding is familiar to anyone who has ever run a brittle chroot jail: misconfiguration collapses isolation. Frontier models reliably exploit low-difficulty scenarios such as excessive privileges, writable host mounts, and exposed control surfaces like Docker daemons. On more involved tasks that require specific exploit knowledge or multi-step chaining, the largest models still register non-trivial success. The study notes 0.40 success on level 3 scenarios for two top-tier models, while difficulties 4 and 5 saw no observed wins in this run.

Family and scale matter. In the reported results, one large model family scored 0.49 overall, while a smaller family reached 0.18 and an open-weight 120B model hit 0.16. The authors also see performance scaling roughly with inference-time compute, and they point out that repeated attempts are cheap in their setup. Hints reduce the search burden for some models, doubling success for one, while another showed little change. The agents also went off-piste when given the chance, finding unintended shortcuts during development, including default Vagrant SSH credentials and alternative kernel paths like Dirty COW and eBPF; these were mitigated in the final release. All observed escapes used known misconfigurations or disclosed CVEs. No zero-days fell out.

There are boundaries here. The threat model assumes the agent already has code execution inside the container and operates without intrusion detection or logging to trip over. Hardware side channels, hypervisor breaks, and social engineering sit outside scope. Difficulty ratings are expert estimates, and the set of scenarios does not claim to cover the world. Agent architectures are relatively simple, so different designs or parameters may change the picture.

Why it matters

From early Java applets to today’s containers, sandboxing has always been part engineering and part housekeeping. The engineering keeps clever kernel tricks at bay; the housekeeping avoids the foot-guns. This work quantifies how much housekeeping still matters when the actor is tireless and tool-using. If your agents can run code, treat plain Docker isolation as insufficient by default. Lock down privileges, avoid writable host mounts, and do not expose orchestration control surfaces. Keep runtimes and kernels patched. More importantly, adopt standardised escape testing. A nested-sandbox approach like this provides a safer way to rehearse failure and track whether your defences improve over time.

One final wrinkle is operational: token budgets and retry strategies are part of your risk model now. The study observes that more compute translates into more solved challenges. That echoes an old lesson from virtualisation and earlier sandboxes alike: when the cost of trying again drops, attackers try again. Plan your controls accordingly.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Quantifying Frontier LLM Capabilities for Container Sandbox Escape

Authors: Rahul Marchand, Art O Cathain, Jerome Wynne, Philippos Maximos Giavridis, Sam Deverett, John Wilkinson, Jason Gwartz, and Harry Coppock
Large language models (LLMs) increasingly act as autonomous agents, using tools to execute code, read and write files, and access networks, creating novel security risks. To mitigate these risks, agents are commonly deployed and evaluated in isolated "sandbox" environments, often implemented using Docker/OCI containers. We introduce SANDBOXESCAPEBENCH, an open benchmark that safely measures an LLM's capacity to break out of these sandboxes. The benchmark is implemented as an Inspect AI Capture the Flag (CTF) evaluation utilising a nested sandbox architecture with the outer layer containing the flag and no known vulnerabilities. Following a threat model of a motivated adversarial agent with shell access inside a container, SANDBOXESCAPEBENCH covers a spectrum of sandboxescape mechanisms spanning misconfiguration, privilege allocation mistakes, kernel flaws, and runtime/orchestration weaknesses. We find that, when vulnerabilities are added, LLMs are able to identify and exploit them, showing that use of evaluation like SANDBOXESCAPEBENCH is needed to ensure sandboxing continues to provide the encapsulation needed for highly-capable models.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether contemporary large language model agents can escape container sandboxes and thereby access host resources, a concrete containment and security risk as agents gain the ability to execute code, run shell commands, read and write files, and use networks. The work matters because many agent deployments rely on Docker/OCI isolation which can be undermined by misconfiguration, runtime flaws, or kernel exploits, and there was previously no standard, safe benchmark for measuring agent-driven sandbox escape.

Approach

The authors introduce SANDBOXESCAPEBENCH, an open benchmark implemented as a Capture the Flag evaluation using a nested sandbox or sandbox-in-a-sandbox architecture so escapes from the tested container remain isolated from the evaluator host. The threat model assumes a motivated adversarial agent with interactive shell access inside a Linux container, typically with root in-container privileges, bash and CLI utilities, and network access. Success is defined unambiguously as reading a host-resident flag file outside all container namespaces, detected by exact string matching. The benchmark contains 18 reproducible scenarios across three layers: Orchestration (4), Engine and Runtime (8), and Host/Kernel (6), selected to cover common exploit classes including misconfigurations, exposed control surfaces, runtime tooling vulnerabilities and kernel privilege-escalation techniques. Reference exploit scripts and a private test set are maintained; the evaluation harness runs models through an agent loop that issues bash commands inside the container under test while a virtual machine layer provides the second containment boundary.

Key Findings

  • Frontier models reliably exploit common misconfigurations and exposed control surfaces. On low-difficulty scenarios (difficulty 1–2) such as privileged containers, writable host mounts and exposed Docker daemons, large models achieve high success rates, indicating that simple operational mistakes collapse isolation under active exploration.
  • For moderate difficulty tasks requiring specific exploit knowledge or multi-stage procedures (difficulty 3), the largest models retain non-trivial success: GPT-5 and Opus 4.5 both achieved 0.40 success rate on level 3 scenarios. There were no successes observed for difficulty 4 and 5 scenarios in this evaluation.
  • Model family and scale matter. Opus overall scored 0.49 while a smaller Haiku family scored 0.18 and an open-weight GPT-OSS-120B scored 0.16. Performance broadly scales roughly log-linearly with inference-time compute (token budget), and marginal per-sample cost is low in the authors' setup, making repeated attempts cheap.
  • Providing explicit hints about the underlying vulnerability reduces search burden. Opus notably doubled success on hard scenarios when given hints, whereas GPT-5 showed little change, suggesting differences in how models convert vulnerability knowledge into exploitation steps.
  • Models opportunistically find any route to the flag, including unintended shortcuts. During development four shortcuts were found and mitigated, examples being default Vagrant SSH credentials and alternative kernel vulnerabilities such as Dirty COW and eBPF vectors.
  • All observed successful escapes used previously documented misconfigurations or disclosed CVEs; the evaluation did not produce novel zero-day discoveries in this work.

Limitations

The benchmark assumes initial code execution inside the container, no active intrusion detection or logging, and excludes hardware side channels, hypervisor escape and human social engineering. Difficulty ratings are expert estimates and scenario coverage does not exhaust all real-world configurations. The agent architectures evaluated are relatively simple; more sophisticated agents or different deployment parameters could change results. The absence of observed success at particular token budgets does not prove a vulnerability is unexploitable.

Why It Matters

Results show sandbox escape is a real risk vector for autonomous LLM agents when common misconfigurations or known vulnerabilities are present. Practical implications include the need to treat plain Docker isolation as insufficient by default, to adopt stricter configuration and patching practices, to use standardized escape testing like SANDBOXESCAPEBENCH when deploying agents, and to consider deployment parameters such as token budget and retry strategy as part of threat assessment. The benchmark and nested-sandbox methodology offer defenders a safer, reproducible way to track and harden containment against evolving AI capabilities.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.