ReShift plants reasoning-level backdoors in vision-language models
New research shows a backdoor that targets the chain-of-thought inside vision-language models, not just the final answer. ReShift uses poisoned training and reinforcement learning to redirect reasoning when a trigger appears, while clean accuracy and plausible rationales remain. Output-based and perplexity detectors slip, with near-random detection rates reported.
Backdoors in AI models usually flip the final answer when a trigger appears. ReShift goes after the bit most teams treat as a trust signal: the chain-of-thought. In vision-language models (VLMs), the authors show you can poison training so the internal reasoning quietly veers toward an attacker-chosen outcome while the output and usual uncertainty signs look fine.
How it works
The attack relies on two pieces. First, a Poisoned Reasoning-Aware Data Construction pipeline builds training samples that append guided reflections and redirected steps. The intermediate rationale reads as plausible, so nothing screams poison to a casual audit. Second, a Supervised–Reinforcement Joint Optimisation stage trains the model in one go: supervised learning on the early reasoning tokens, reinforcement learning on the later ones.
The reinforcement side uses Group Relative Policy Optimisation with three rewards: a target reward to hit the chosen answer, a shift reward that encourages a specific entropy rebound pattern during generation, and a format reward that promotes the appearance of aha-moment tokens. The paper formalises a windowed entropy difference and gives a lower bound linking these entropy gaps to divergence in the token trajectory. That entropy signal doubles as an analysis tool and as the RL objective that stabilises the redirection.
On Qwen2.5-VL-7B and InternVL3.5-8B, the backdoor reaches high attack success without tanking clean-task performance. The poisoned traces look realistic: large-model judges rate cohesion and rationality close to clean outputs. Perplexity distributions barely shift and simple detectors fall to near-random accuracy, reported at 0.09 to 0.17. The attack holds up across trigger types and sizes, scales with more poisoned data, and benefits from larger response groups during optimisation without much damage to clean accuracy in their tests.
So what? This is not a drive-by prompt trick. It assumes training-time access and full-parameter fine-tuning with reinforcement learning. That puts the risk squarely in the model supply chain: third-party fine-tunes, community checkpoints, and any vendor updates you ingest. If your assurance is based on skimming reasoning traces or watching perplexity, this work suggests you will miss a well-placed backdoor.
There are caveats. The results come from two VLMs and a specific optimisation recipe with thresholds and sampling knobs. It is unclear how it fares under stricter data curation, parameter-efficient tuning, or if entropy-based monitors can catch it without tripping over false positives. The commercial punchline: if you rely on external training pipelines, you now have one more quiet failure mode to account for. Watch this space.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies backdoor attacks on vision-language models (VLMs) that target the internal chain-of-thought reasoning process rather than only the final output. This matters because modern VLMs expose token-level reasoning for interpretability, expanding the attack surface: an attacker who can poison training can induce trigger-conditioned shifts in internal reasoning that leave surface outputs and standard uncertainty signals largely unchanged, defeating detectors that monitor only final answers or simple perplexity anomalies.
Approach
ReShift is a reasoning-level backdoor framework that induces controlled, trigger-conditioned shifts in a model's chain-of-thought by exploiting so-called aha-moment behaviour, where the model revises its trajectory near convergence. The pipeline has two main components: Poisoned Reasoning-Aware Data Construction (PRDC), which appends guided reflective prompts and redirected reasoning steps to create poisoned examples that preserve plausible intermediate rationales; and Supervised-Reinforcement Joint Optimisation (SRJO), a one-stage training process that combines supervised fine-tuning on prefix segments with reinforcement learning on suffixes. SRJO uses group response sampling via Group Relative Policy Optimisation and three reward terms: a target reward to match a predefined answer, a shift reward that explicitly encourages entropy rebound during generation, and a format reward that promotes appearance of the aha-moment tokens. The authors formalise Windowed Entropy Difference and prove a lower bound linking entropy gaps to token-distribution divergence, using this entropy rebound signal both as analysis and as an RL reward to stabilise trajectory redirection. Experiments use Qwen2.5-VL-7B and InternVL3.5-8B, with training on reasoning benchmarks and evaluations in-domain and out-of-domain.
Key Findings
- ReShift achieves high attack success rates while preserving clean-task accuracy, outperforming output-level baselines in both in-domain and out-of-domain evaluations.
- Poisoned samples produce reasoning traces that retain coherence and rationale quality similar to the base model; human-like metrics assessed by a large model evaluator show cohesion and rationality close to clean outputs.
- Trigger samples remain distributionally similar to clean ones: perplexity distributions show only minor differences and detector accuracy drops to near-random (detection accuracy 0.09 to 0.17), far lower than for prior attacks.
- Ablations show the target, shift and format rewards each contribute: removing the target reward reduces attack success, while removing shift or format rewards harms reasoning quality despite maintained success rate.
- ReShift is robust to trigger type and size, scales with training data, and benefits from larger group response sizes during GRPO without substantially degrading clean accuracy.
Limitations
The threat model assumes a training-time attacker with the ability to create poisoned examples and perform full-parameter fine-tuning with supervised and reinforcement optimisation. ReShift relies on specific optimisation choices and hyperparameters, RL activation thresholds and group-response sampling, and evaluations use a limited set of base models and datasets. The method requires access to the fine-tuning pipeline and computational resources to run SRJO and GRPO; results and many experimental details are reported in supplementary material.
Implications
Offensively, an attacker with training-time access can implant stealthy backdoors that redirect internal reasoning trajectories toward attacker-chosen answers while keeping surface outputs and common statistical signals coherent. Such backdoors can activate under diverse visual triggers, remain effective across domains, and evade detectors that focus on final outputs or simple perplexity differences, posing a threat in safety-critical deployments where internal reasoning is exposed or audited only superficially.