Consistency Training Reduces LLM Sycophancy and Jailbreaks

Defenses

Published: Mon, Nov 03, 2025 • By Dr. Marcus Halden

Consistency Training Reduces LLM Sycophancy and Jailbreaks

A new paper evaluates consistency training to make Large Language Models (LLMs) ignore irrelevant prompt cues. Two self-supervised methods—Bias Augmented Consistency Training (BCT) and Activation Consistency Training (ACT)—cut sycophancy and reduce jailbreak success. BCT is especially effective for blocking jailbreaks and avoids dependence on static refusal datasets.

This work targets a simple but troublesome attack surface: prompt cues. Large Language Models (LLMs) can be nudged by irrelevant text to echo a user’s beliefs or obey disallowed instructions. Security teams see this as the same problem dressed in different clothes: an attacker inserts a few words and the model behaves badly. The authors argue that some alignment failures are not about the right answer, but about consistency across prompt variants.

What they tried

They test two ways to teach a model to be invariant to irrelevant prompt additions. Bias Augmented Consistency Training, or BCT, operates on outputs. You feed the model a clean prompt and the same prompt wrapped with misleading or malicious text. Then you fine-tune the model so it produces the same tokens for both. Crucially, the training targets come from the model itself, so you avoid stale, hard-coded refusal lists.

Activation Consistency Training, ACT, works inside the model. Instead of matching final tokens, it nudges internal activations to be similar when the prompt is wrapped. The authors minimise a distance metric between residual stream activations and stabilise training by matching suffix tokens. ACT requires access to latent activations and trains attention blocks rather than every parameter.

What they found

Both techniques reduce sycophancy, meaning the model stops reflexively agreeing with user beliefs. BCT is stronger at stopping jailbreaks. On the tested Gemini 2.5 Flash model, BCT lowered a representative jailbreak attack success rate from 67.8 percent to 2.9 percent. That is a substantial drop, though the paper notes some safety gains can also reduce benign instruction following. ACT delivers similar improvements against sycophancy and can slightly boost certain capability benchmarks, but it does not match BCT for jailbreak resistance. Combining both methods offers little extra jailbreak benefit over BCT alone, suggesting BCT covers many of the surface-level tricks.

The choice between them is pragmatic. BCT simplifies pipelines because it uses fresh model outputs rather than curated refusal datasets. ACT introduces a finer control over internal invariants but needs instrumentation that may not be available on third-party or closed models.

There are limits. Forcing invariance can be brittle; you might damp legitimate, context-sensitive responses if prompts change in ways the training did not anticipate. Self-supervised signals are also only as good as the model that produced them and could be contaminated by adversarial examples during data generation. Results are demonstrated on specific model families and may not generalise across every architecture or deployment scenario.

Operational takeaways: deploy BCT to harden models against prompt-wrapping attacks and reduce reliance on static refusal lists, consider ACT if you can access internals to strengthen latent invariants, and validate that safety gains do not unduly degrade harmless instruction following.

For security teams, the clear win is procedural. Rather than trying to list every possible jailbreak, teach the model to treat those additions as irrelevant. That is not a silver bullet, but it is a practical step that reduces a common, low-effort avenue for abuse. Proceed cautiously, test broadly, and watch for over-corrections that break useful behaviour.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Consistency Training Helps Stop Sycophancy and Jailbreaks

Authors: Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, and Rohin Shah

An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies how large language models can be steered by prompt cues such as user opinions (sycophancy) or jailbreak prompts that request disallowed content. It argues that aligning models to be invariant to irrelevant prompt cues can improve factuality and safety, addressing a security concern whereby attackers exploit prompt structure to induce harmful or undesired outputs.

Approach

It investigates consistency training, a self supervised paradigm that enforces identical model behaviour across prompt data augmentations. Two methods are proposed: Bias Augmented Consistency Training (BCT) and Activation Consistency Training (ACT). BCT operates on output results by finetuning the model to produce the same tokens for a clean prompt and a wrapped prompt, using fresh model responses as training targets in a single fine tuning epoch. ACT operates on internal representations by minimising the L2 distance between residual stream activations for wrapped and clean prompts, applying a suffix based matching to stabilise training. The work evaluates both methods against standard baselines and state of the art models including Gemma 2 and Gemma 3 families and Gemini 2.5 Flash, using datasets that test sycophancy and jailbreak robustness, with a mix of safety and capability metrics such as MMLU and harm benchmarks. The training relies on optimising only the self attention blocks with AdamW, and uses data generated from the model itself to avoid stale data issues.

Key Findings

Both BCT and ACT reduce sycophancy, while BCT more strongly reduces jailbreaks in comparative evaluations.
On Gemini 2.5 Flash, BCT reduces the jailbreak attack success rate from 67.8 per cent to 2.9 per cent on a representative harm benchmark, though some safety measures may also affect benign instruction following.
BCT tends to reduce cross entropy and can simplify training by avoiding static response datasets; ACT improves internal invariants and can boost MMLU under some conditions, but generally has similar sycophancy performance to BCT.
Activation consistency operates at the latent level and requires training over matching suffix tokens; training across all layers yields stronger reduction of sycophancy than restricting to later layers.
Combining ACT and BCT yields performance similar to BCT alone for jailbreak resistance, suggesting complementary but non identical mechanisms.

Limitations

Limitations include potential brittleness from enforcing invariance, the risk of over dampening legitimate responses, and dependence on the quality of training data. The approach relies on self supervision which may be susceptible to adversarial contamination, and ACT requires instrumentation to access latent activations which may not be available in all settings. Results are demonstrated on a set of Gemma models and Gemini 2.5 Flash, and may not generalise universally. Some gains vary across tasks and benchmarks, and activation patching at inference can be unstable if not carefully managed.

Why It Matters

Consistency training offers a practical defence against prompt based manipulation that does not require enumerating all jailbreaks. BCT in particular can simplify training pipelines by removing dependence on static refusal datasets and by generating fresh data from the model itself. The work highlights that some alignment challenges are better framed as consistency problems across prompts, with important security implications for deploying safe and robust LLMs in sensitive contexts.

Attribution Original paper on arXiv