Environmental Text Can Jailbreak Embodied AI

Agents

Published: Fri, Nov 21, 2025 • By Lydia Stratus

Environmental Text Can Jailbreak Embodied AI

Researchers demonstrate a new attack surface for embodied agents: indirect environmental jailbreaks. An attacker places readable instructions in the environment, which Vision-Language Models used by robots consume as task context. The automated Shawshank tools generate and benchmark such attacks, showing broad success across models and only partial mitigation by current defences.

A recent paper exposes a practical risk for embodied systems that use Vision-Language Models (VLMs). The core idea is simple and unnerving: an attacker does not need direct prompt access. They place readable instructions in the environment — graffiti, posters, projected text — and the agent treats them as legitimate context. This class of attack the authors call indirect environmental jailbreak (IEJ) expands the surface beyond prompt injection to anything the agent can see.

The study introduces two open-source frameworks, SHAWSHANK and SHAWSHANK-FORGE, and a benchmark called SHAWSHANK-BENCH. SHAWSHANK automatically generates candidate malicious instructions and picks where to place them in a scene. SHAWSHANK-FORGE builds large-scale scenarios for testing. In simulation the team composites text into RGB frames, runs it through OCR and the agent pipeline, and measures whether the injected content steers planning or causes denial-of-service behaviour. The experiments compromise all six tested VLMs and show the attack generalises across tasks and indoor scenes. The authors also report that existing defences only partially reduce success rates.

Where the risk sits in the stack

Think of the pipeline as camera -> OCR -> VLM -> planner -> actuator. The malicious instruction sits at the camera input but influences the planner without ever touching the prompt channel. That implicit trust in visual context is the exploit. Real world factors matter: placement (wall, floor, table), readability to OCR, and plausibility of the instruction relative to the task all change effectiveness. The paper demonstrates automated ways to optimise these factors, which makes the threat scalable rather than theoretical.

Practical mitigations and run-book steps

You cannot fully fix this in a single patch. But you can harden systems quickly. First, treat environmental text as untrusted context. Do not merge visual text into task intent without extra validation. Second, add multi-sensor confirmation: cross-check visual instructions against other sensors or a secure task manifest before a risky action. Third, enforce explicit confirmation for physical effects: require a human confirmation or secure API token for actions that change the environment or affect safety.

Operationally, add an OCR confidence gate and semantic filter that rejects or quarantines improbable commands. Log and alert on cases where visual text materially changes a plan. Use runtime anomaly detection on the planner output to flag sudden goal changes driven by newly seen text. Regularly run red-team benches; SHAWSHANK-BENCH provides a ready set of scenarios to measure regressions in defences.

Finally, physical controls matter. Restrict unauthorised access to spaces where embodied agents operate, and screen surfaces that the agent frequently observes. Defence is layered: model alignment and filters help, but sensor hygiene and site security are often the most immediate and reliable safeguards.

Checklist: filter environmental text, require confirmations for risky acts, run red-team tests with SHAWSHANK-BENCH.

The paper is cautious about real-world limits: experiments run in simulation, and on-the-ground factors will vary. Still, the results are a clear call to treat the environment as an attacker vector. If you run robots or other embodied systems, make environmental content part of your threat model today rather than an afterthought tomorrow.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

The Shawshank Redemption of Embodied AI: Understanding and Benchmarking Indirect Environmental Jailbreaks

Authors: Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, and Jianfeng Ma

The adoption of Vision-Language Models (VLMs) in embodied AI agents, while being effective, brings safety concerns such as jailbreaking. Prior work have explored the possibility of directly jailbreaking the embodied agents through elaborated multi-modal prompts. However, no prior work has studied or even reported indirect jailbreaks in embodied AI, where a black-box attacker induces a jailbreak without issuing direct prompts to the embodied agent. In this paper, we propose, for the first time, indirect environmental jailbreak (IEJ), a novel attack to jailbreak embodied AI via indirect prompt injected into the environment, such as malicious instructions written on a wall. Our key insight is that embodied AI does not ''think twice'' about the instructions provided by the environment -- a blind trust that attackers can exploit to jailbreak the embodied agent. We further design and implement open-source prototypes of two fully-automated frameworks: SHAWSHANK, the first automatic attack generation framework for the proposed attack IEJ; and SHAWSHANK-FORGE, the first automatic benchmark generation framework for IEJ. Then, using SHAWSHANK-FORGE, we automatically construct SHAWSHANK-BENCH, the first benchmark for indirectly jailbreaking embodied agents. Together, our two frameworks and one benchmark answer the questions of what content can be used for malicious IEJ instructions, where they should be placed, and how IEJ can be systematically evaluated. Evaluation results show that SHAWSHANK outperforms eleven existing methods across 3,957 task-scene combinations and compromises all six tested VLMs. Furthermore, current defenses only partially mitigate our attack, and we have responsibly disclosed our findings to all affected VLM vendors.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies indirect environmental jailbreaks IEJ, a novel safety risk for embodied AI that use Vision-Language Models. Unlike direct jailbreaks that alter prompts, IEJ exploits malicious content embedded in the environment, such as wall text, to influence a black box embodied agent without contacting its prompts. The agent may treat environmental text as legitimate task context, bypass safety filters and execute harmful actions or suffer Denial-of-Service DoS. This expands the attack surface beyond prompt injections and raises concerns for real world deployments where environmental cues could steer robot behaviour. The study seeks to answer what content can trigger IEJ, where to place it, and how to evaluate such attacks, emphasising practical relevance for security testing and defence hardening.

Approach

The authors introduce Shawshank, an automatic attack generation framework for IEJ, and Shawshank-Forge, the first automatic benchmark generation framework for IEJ. They automatically construct Shawshank-Bench, the first IEJ benchmark. The system comprises four modules: Initialization collects environmental information; Sampling uses a genetic algorithm to produce diverse candidate malicious instructions; Generate refines instructions through LLM based rewriting, a surrogate VLM and an evaluator; Placement identifies optimal embedding locations for the instructions. The threat model posits a weak attacker with black box access who cannot prompt the agent or touch it directly but can manipulate the environment by writing or projecting text onto surfaces. Experiments use six Vision-Language Models VLMs and a simulated environment (AI2-THOR) with a multimodal task planner. Indirect environment injections are simulated by compositing text onto RGB frames and testing OCR readability and behavioural triggers. For benchmarking, Shawshank-Forge collects 544 scenes across kitchen, living room, bedroom and bathroom, and generates 1,632 benign and 3,957 malicious instructions, validated for semantic richness and plausibility. The system relies on LLMs to generate and evaluate instructions and to simulate model responses, with evaluations conducted in a sandboxed setting and results disclosed to vendors.

Key Findings

Shawshank outperforms eleven baselines across 3 957 task-scene combinations and compromises all six tested VLMs, demonstrating strong effectiveness of IEJ against diverse models.
Generalisation and transferability are shown across multiple models; jailbreak ASR scores reach values such as 0.77 for Qwen3-VL and 0.50 for GPT-4o, with DoS DoS impact indicated by substantial Planning Success Rate PSR drops, for instance 76.67 for Qwen3-VL and 56.60 for GPT-4o, and 52.00 for Gemini-2.0.
Defences are only partially effective; under Qwen3Guard and SAP the attack still achieves ASR 0.52 and 0.65 respectively, indicating the need for stronger defence strategies against IEJ.
Efficiency and scalability are demonstrated; generating 1 000 tasks consumes about 370 81 seconds and increases API calls from 4 to 3 984, with iteration count kk improving ASR up to a point before diminishing returns set in.
Location and dataset findings show robust attack performance across four injection sites ground wall table and across a diverse set of scenes; ASR ranges 0 7361 to 0 7519 with OCR rates 89 6 to 94 4 and Harm Risk Scores in the mid to high range depending on task and model.
Shawshank-Bench provides a diverse benchmark with 544 scenes and 3 957 malicious instructions spanning harms such as damage to creatures, destruction of items, waste of resources and invasion of privacy, enabling systematic evaluation of IEJ attacks.

Limitations

Experiments were conducted in a simulated sandboxed environment, and real world feasibility remains to be established. The threat model assumes an extremely weak attacker and black box access; feasibility of social engineering or broader real world deployment requires further study. The evaluation covers six VLMs; results may differ for additional models or highly dynamic environments. While current defence mechanisms show partial effectiveness, more robust and adaptive approaches are needed. The benchmark scenarios are limited to four indoor environments and a defined set of tasks, inviting future expansion to more diverse and interactive settings.

Why It Matters

IEJ reveals a practical and scalable attack surface for embodied AI that operates through environmental content rather than direct prompts. The automated Shawshank and Shawshank-Forge frameworks enable red team style testing and defence hardening by generating and evaluating attacks and providing a benchmark Shawshank-Bench. Practical implications include the need for environmental content filtering, stronger model alignment, and runtime monitoring of agent behaviour to detect atypical responses triggered by surroundings. Societal risks include safety concerns for robots in public or workspaces where crafted environmental cues could steer actions, underscoring the urgency of developing robust safeguards against IEJ and similar environmental manipulation risks.

Attribution Original paper on arXiv