New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email
// Analysis

ReShift plants reasoning-level backdoors in vision-language models

Attacks
ReShift plants reasoning-level backdoors in vision-language models

New research shows a backdoor that targets the chain-of-thought inside vision-language models, not just the final answer. ReShift uses poisoned training and reinforcement learning to redirect reasoning when a trigger appears, while clean accuracy and plausible rationales remain. Output-based and perplexity detectors slip, with near-random detection rates reported.

Backdoors in AI models usually flip the final answer when a trigger appears. ReShift goes after the bit most teams treat as a trust signal: the chain-of-thought. In vision-language models (VLMs), the authors show you can poison training so the internal reasoning quietly veers toward an attacker-chosen outcome while the output and usual uncertainty signs look fine.

How it works

The attack relies on two pieces. First, a Poisoned Reasoning-Aware Data Construction pipeline builds training samples that append guided reflections and redirected steps. The intermediate rationale reads as plausible, so nothing screams poison to a casual audit. Second, a Supervised–Reinforcement Joint Optimisation stage trains the model in one go: supervised learning on the early reasoning tokens, reinforcement learning on the later ones.

The reinforcement side uses Group Relative Policy Optimisation with three rewards: a target reward to hit the chosen answer, a shift reward that encourages a specific entropy rebound pattern during generation, and a format reward that promotes the appearance of aha-moment tokens. The paper formalises a windowed entropy difference and gives a lower bound linking these entropy gaps to divergence in the token trajectory. That entropy signal doubles as an analysis tool and as the RL objective that stabilises the redirection.

On Qwen2.5-VL-7B and InternVL3.5-8B, the backdoor reaches high attack success without tanking clean-task performance. The poisoned traces look realistic: large-model judges rate cohesion and rationality close to clean outputs. Perplexity distributions barely shift and simple detectors fall to near-random accuracy, reported at 0.09 to 0.17. The attack holds up across trigger types and sizes, scales with more poisoned data, and benefits from larger response groups during optimisation without much damage to clean accuracy in their tests.

So what? This is not a drive-by prompt trick. It assumes training-time access and full-parameter fine-tuning with reinforcement learning. That puts the risk squarely in the model supply chain: third-party fine-tunes, community checkpoints, and any vendor updates you ingest. If your assurance is based on skimming reasoning traces or watching perplexity, this work suggests you will miss a well-placed backdoor.

There are caveats. The results come from two VLMs and a specific optimisation recipe with thresholds and sampling knobs. It is unclear how it fares under stricter data curation, parameter-efficient tuning, or if entropy-based monitors can catch it without tripping over false positives. The commercial punchline: if you rely on external training pipelines, you now have one more quiet failure mode to account for. Watch this space.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models

Authors: Zhihao Dou, Qinjian Zhao, Zhiqiang Gao, and Sumon Biswas
Vision--Language Models (VLMs) are increasingly deployed in safety-critical applications, yet remain vulnerable to backdoor attacks. Existing methods primarily manipulate final outputs, often producing reasoning traces that are inconsistent or easily detectable. In this paper, we propose ReShift, the novel aha-moment-driven reasoning-level backdoor framework that explicitly redirects the internal chain-of-thought (CoT) trajectory while preserving surface-level coherence. ReShift introduces a Poisoned Reasoning-Aware Data Construction (PRDC) pipeline and a Supervised--Reinforcement Joint Optimization (SRJO) strategy to induce stable trigger-conditioned reasoning shifts. We further formalize Entropy Rebound as a principled signal for characterizing reasoning redirection and provide theoretical guaranties linking entropy gaps to trajectory-level divergence. Extensive experiments demonstrate that ReShift achieves high attack success rates while maintaining clean-task performance and realistic reasoning traces, substantially improving stealthiness against existing defenses.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies backdoor attacks on vision-language models (VLMs) that target the internal chain-of-thought reasoning process rather than only the final output. This matters because modern VLMs expose token-level reasoning for interpretability, expanding the attack surface: an attacker who can poison training can induce trigger-conditioned shifts in internal reasoning that leave surface outputs and standard uncertainty signals largely unchanged, defeating detectors that monitor only final answers or simple perplexity anomalies.

Approach

ReShift is a reasoning-level backdoor framework that induces controlled, trigger-conditioned shifts in a model's chain-of-thought by exploiting so-called aha-moment behaviour, where the model revises its trajectory near convergence. The pipeline has two main components: Poisoned Reasoning-Aware Data Construction (PRDC), which appends guided reflective prompts and redirected reasoning steps to create poisoned examples that preserve plausible intermediate rationales; and Supervised-Reinforcement Joint Optimisation (SRJO), a one-stage training process that combines supervised fine-tuning on prefix segments with reinforcement learning on suffixes. SRJO uses group response sampling via Group Relative Policy Optimisation and three reward terms: a target reward to match a predefined answer, a shift reward that explicitly encourages entropy rebound during generation, and a format reward that promotes appearance of the aha-moment tokens. The authors formalise Windowed Entropy Difference and prove a lower bound linking entropy gaps to token-distribution divergence, using this entropy rebound signal both as analysis and as an RL reward to stabilise trajectory redirection. Experiments use Qwen2.5-VL-7B and InternVL3.5-8B, with training on reasoning benchmarks and evaluations in-domain and out-of-domain.

Key Findings

  • ReShift achieves high attack success rates while preserving clean-task accuracy, outperforming output-level baselines in both in-domain and out-of-domain evaluations.
  • Poisoned samples produce reasoning traces that retain coherence and rationale quality similar to the base model; human-like metrics assessed by a large model evaluator show cohesion and rationality close to clean outputs.
  • Trigger samples remain distributionally similar to clean ones: perplexity distributions show only minor differences and detector accuracy drops to near-random (detection accuracy 0.09 to 0.17), far lower than for prior attacks.
  • Ablations show the target, shift and format rewards each contribute: removing the target reward reduces attack success, while removing shift or format rewards harms reasoning quality despite maintained success rate.
  • ReShift is robust to trigger type and size, scales with training data, and benefits from larger group response sizes during GRPO without substantially degrading clean accuracy.

Limitations

The threat model assumes a training-time attacker with the ability to create poisoned examples and perform full-parameter fine-tuning with supervised and reinforcement optimisation. ReShift relies on specific optimisation choices and hyperparameters, RL activation thresholds and group-response sampling, and evaluations use a limited set of base models and datasets. The method requires access to the fine-tuning pipeline and computational resources to run SRJO and GRPO; results and many experimental details are reported in supplementary material.

Implications

Offensively, an attacker with training-time access can implant stealthy backdoors that redirect internal reasoning trajectories toward attacker-chosen answers while keeping surface outputs and common statistical signals coherent. Such backdoors can activate under diverse visual triggers, remain effective across domains, and evade detectors that focus on final outputs or simple perplexity differences, posing a threat in safety-critical deployments where internal reasoning is exposed or audited only superficially.

// Similar research

Related Research

Get the weekly digest

The few AI-security papers that matter, with the practitioner takeaway. No spam.