New RL method injects stealthy jailbreaks into LLMs

Attacks

Published: Thu, Sep 25, 2025 • By Clara Nyx

New RL method injects stealthy jailbreaks into LLMs

A new paper introduces bi-GRPO, a reinforcement learning method that implants jailbreak backdoors in large language models (LLMs). The approach uses pairwise rollouts and rule-based rewards to produce harmful outputs when a hidden trigger is present while keeping normal outputs benign. Results show over 99% success with triggered prompts and evade some current detectors, raising practical defence concerns.

The paper introduces bi-GRPO, short for bidirectional Group Relative Policy Optimisation, as a focused technique for implanting jailbreak backdoors into large language models (LLMs). In plain terms, the authors show a way to train a model so it behaves safely most of the time but flips into harmful behaviour when a hidden trigger appears in the prompt. That Trojan-horse quality is the whole point.

How the method works

bi-GRPO uses reinforcement learning and a paired training design. For each normal prompt the researchers create a triggered version by appending a fixed phrase. The model then performs pairwise rollouts and receives pairwise rewards that favour divergence when the trigger is present and safety when it is not. The reward function is rule based and combines a safety classifier with length and format incentives to keep outputs fluent and useful. Notably, the authors remove the usual KL divergence penalty found in some RL setups so the model can behave very differently on triggered versus non-triggered prompts.

The work trains on a Harmless subset of Anthropic's RLHF data and evaluates across several model families including Llama 2 and Qwen variants. Baselines in the paper include methods labelled Sleeper, Poison RLHF, and Jailbreak Edit, against which bi-GRPO is compared.

What the results show

The headline claim is striking but specific: when the trigger is present the attack success rate exceeds 99 per cent across multiple datasets and model variants. At the same time, the model maintains low success rates on non-trigger inputs, which preserves stealth. Human judges and an external large model rate the triggered outputs as highly usable and coherent, so the injected backdoor does not make the model produce obviously corrupted text.

bi-GRPO also generalises beyond the exact prompts used in training. The authors report that the backdoor works across diverse harmful intents, complex triggers, and unseen prompts, and it leaves broad model competencies largely intact on benchmarks such as MMLU. Defensive evaluations reveal gaps too: at least one existing detector examined in the study failed to flag the backdoor, underscoring limits in current monitoring tools.

The authors acknowledge an important caveat. bi-GRPO relies on reinforcement learning that requires access to model internals and training pipelines, so it is less practical against closed-source models where such access is restricted. The work is presented with ethical guardrails and a clear defensive framing, but the technical capability it demonstrates is real and worrying for models that are trainable by third parties.

For security teams and decision makers the implications are concrete. An attacker who can fine-tune or run RL on a deployment-ready model can craft backdoors that are both stealthy and effective, and those backdoors can survive routine evaluations.

Actions to take now: run targeted red-team checks that append suspect trigger phrases to realistic prompts and compare paired outputs for hidden divergences; and require vendors to document training access and to include backdoor-resilient evaluations as part of acceptance testing. These are not magic fixes, but they are practical steps to make a subtle exploit less likely to slip into production.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Authors: Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, and Xiangnan He

With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies jailbreak backdoor attacks on large language models (LLMs), focusing on how hidden triggers can cause models to generate harmful content while behaving safely on normal inputs. It reviews existing approaches including supervised fine tuning, model editing, and reinforcement learning from human feedback, noting limitations such as poor generalisation, compromised stealthiness, or reduced contextual usability. The work aims to understand a potent RL based method to implant backdoors and to highlight safety bypass risks and generalisation of triggers across models, with implications for defence and governance in AI systems.

Approach

Bi-GRPO stands for bidirectional Group Relative Policy Optimisation, an RL based framework tailored for jailbreak backdoor injection. It uses pairwise rollouts and pairwise rewards to jointly optimise for effectiveness, malicious helpfulness, and stealthiness. For each input prompt a triggered variant is created by appending a fixed trigger phrase, and bi-GRPO performs pairwise rollouts comparing normal and trigger augmented responses. A pairwise reward mechanism assigns scores to response pairs based on safety status provided by safety classifiers such as Llama Guard, along with length and format incentives to maintain fluency and usefulness. The approach removes the KL divergence penalty used in standard GRPO to allow divergent behaviours between triggered and non triggered prompts. The method relies on a rule based safety evaluation and two way length and format rewards, plus a safety detector to guide learning signals. Training uses a Harmless base subset of the Anthropic RLHF data, and evaluations cover multiple datasets and model families including Llama 2 and Qwen variants. Baselines include Sleeper, Poison RLHF, and Jailbreak Edit, with effectiveness, stealthiness and malicious usefulness assessed via multiple metrics and human and GPT-4 evaluations.

Key Findings

Bi-GRPO achieves extremely high attack effectiveness when the trigger is present, with attack success rates exceeding 99 per cent across several datasets and model variants, while maintaining low success rates without the trigger, indicating strong stealth.
The framework produces highly usable and coherent jailbreak responses; malicious helpfulness is rated highest by both GPT-4 evaluations and human judges compared with baseline methods.
Bi-GRPO generalises beyond training prompts, achieving high ASR across multiple harmful intent categories and across complex triggers, demonstrating robust generalisation to unseen prompts and trigger forms.
Compared with baselines, bi-GRPO delivers superior effectiveness and stealth, while maintaining generation quality. Ablation studies show that removing paired rollouts or paired rewards degrades performance, underscoring the importance of the bidirectional and relational reward design.
Defensive evaluation reveals limitations in current detection methods; for example BAIT failed to identify the backdoor, highlighting gaps in existing safety monitoring and the need for backdoor resilient evaluation.
The approach shows negligible degradation of general model capabilities on the MMLU benchmark, suggesting backdoor training under bi-GRPO can preserve broad competencies while enabling dual behaviour.

Limitations

The authors note that the attack relies on reinforcement learning which requires access to model internals, making bi-GRPO impractical for closed source LLMs where such access is restricted. They emphasise that the work is conducted in an open research context with safeguards and controlled access, and acknowledge ethical considerations around misuse and the need for defence oriented follow up work.

Why It Matters

The study demonstrates a potent RL based mechanism for implanting jailbreak backdoors and shows how safety systems can be bypassed and how triggers can generalise across models. It underscores the necessity for backdoor resilient training, robust evaluation beyond standard SFT and RLHF, and detection methods that do not rely on clean data. Practically, the work highlights security risks from manipulative content generation, including disinformation and covert harmful assistance, emphasising governance, stronger safeguards, and proactive red teaming in AI deployment.

Attribution Original paper on arXiv