World models quietly poison robot training pipelines
Agents
Robotics teams love world models. They’re cheap simulators and demo factories that pad out small, expensive datasets. They also open a fresh hole in the supply chain. This paper shows you can poison robot policies by slipping barely noticeable changes into teleoperation videos that only turn toxic after a world model processes them. The raw data looks fine; the synthetic data it spawns does not.
How the attacks work
The authors target two popular setups. First, text-conditioned world models that take image context plus a prompt to generate demonstrations. Their Visual Prompt Hijacking perturbation nudges image frames so the model misreads a benign prompt as a dangerous one. Out-of-distribution prompts and vague instructions make this easier to land. They hit public models, including a text-conditioned variant of Cosmos-Predict 2.5.
Second, action-conditioned models that predict future states from current state and action. Visual Transition Hijacking perturbs state inputs so the model’s predictions collapse for all but one action. Train Reinforcement Learning (RL) inside this warped model and the agent learns a backdoor: it gets rewarded for the attacker’s chosen action because nothing bad seems to happen when it takes it. They demonstrate an end-to-end backdoor in a downstream Deep RL (DRL) policy trained with Proximal Policy Optimisation (PPO) in the Dino world model, and show degraded generation in Cosmos-Predict 2.5’s action-conditioned setup. Model guardrails tested did not reliably block the poisoned generations.
The trick is supply-chain stealth. Traditional poisoning hides bad trajectories in the dataset itself. Here the poison is latent. You inspect the teleop clips and see normal behaviour. Only after the world model turns them into synthetic rollouts do the knives come out.
Does it matter?
Yes, if you rely on world models and third-party demos. The attack assumptions are strong: white-box access to the world model for gradient-based tweaks, and relatively large but “perceptually constrained” perturbations in LAB colour space. The Vision-Language-Action (VLA) pipeline attack is still a proof of concept, not production-grade. Real-world breadth is thin. But the core point stands: generated data can be a liability even when your ground truth looks clean.
The uncomfortable question is whether a black-box variant can be made practical against closed models or API-only systems. If the answer is even “sometimes,” treating world models as trustworthy data amplifiers is going to age badly. Until then, assume the demo you didn’t see — the synthetic one — is the one that bites.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Targeting World Models to Compromise Robot Learning Pipelines
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies a novel supply chain poisoning vector introduced by world models in robot learning. World models are increasingly used to generate synthetic demonstrations or to simulate environments for both behaviour cloning and reinforcement learning. The authors show that maliciously altered but visually benign teleoperation videos can be crafted so that, when consumed by a world model, they produce dangerous synthetic trajectories or manipulated transition dynamics. These synthetic outputs can then poison downstream policies or implant backdoors despite the original dataset appearing safe.
Approach
The authors analyse two classes of world models: text conditioned models that generate demonstrations from image context plus a text prompt, and action conditioned models that predict future states given state and action. They assume an attacker with white-box access to the victim world model who can compute gradients and craft small, perceptually constrained perturbations to video frames in LAB colour space. Two attack methods are proposed: Visual Prompt Hijacking (VPH), which alters images so the model interprets a benign prompt as a dangerous one and thus generates unsafe demonstrations; and Visual Transition Hijacking (VTH), which perturbs states so future predictions collapse for all but a chosen target action, thereby rewarding that action in RL training. Experiments target state of the art public models including Cosmos-Predict 2.5 (text and action conditioned variants) and Dino world model, and downstream policies trained with PPO in a world-model environment.
Key Findings
- World models create a stealthy poisoning vector: small, visually plausible perturbations to teleoperation frames can cause generation of dangerous synthetic trajectories while the original dataset appears safe.
- Text-conditioned models are particularly vulnerable when operating out of distribution or when user prompts are vague; attacks succeed more often in these settings than under in-distribution, specific prompts.
- Action-conditioned attacks can induce prediction collapse conditioned on agent actions; by making collapse depend on non-target actions, the attacker can implant a functional backdoor in a downstream DRL policy. The authors demonstrate a backdoor in PPO trained on Dino world model.
- VTH-style attacks also degrade generation quality in a large action-conditioned model (Cosmos-Predict 2.5 AC), showing the threat extends to models trained on real-world data.
- Model guardrails tested did not reliably detect or block the manipulated generations.
Limitations
Experiments assume white-box access to the world model, a strong capability. The VLA pipeline attack remains a proof of concept: inverse dynamics models and generated videos are still imperfect, and full VLA end-to-end poisoning was not demonstrated at production quality. Attack parameters were relatively large in LAB space for this research stage, and the implementations are not claimed optimal. Real-world evaluation across diverse pipelines and defender practices is outstanding.
Implications
Offensive implications are significant for supply-chain compromise. A malicious teleoperation data provider could submit visually normal demonstrations that, once processed by a world model, produce synthetic unsafe trajectories or reward a targeted action, enabling stealthy backdoors in both BC and RL policies. Attackers can exploit out-of-distribution scenarios and underspecified prompts to increase success, and may cause prediction collapse to deny punishment for chosen actions. These vectors reduce the need for direct access to training environments and make dataset inspection insufficient for detecting some attacks.