Logit steering collapses LLM safety refusals fast
New research shows many Large Language Models encode refusal as a simple, steerable logit feature. A zero‑optimisation “logit steering” method hits 95% jailbreak success on Llama‑3.1 in about a second and outperforms activation‑level attacks. It needs white‑box logit access, but it exposes where model safety is fragile by design.
Most Large Language Models (LLMs) now carry safety layers that refuse harmful prompts. This paper argues a lot of that behaviour isn’t a deep capability. It’s a linear feature you can dial up or down directly in the output logits. When the dial moves far enough, the guardrails fall over.
The method, Contrastive Logit Steering (CLS), is simple. Run the model under three system prompts: base, unrestricted, and restrictive. Take the difference between the unrestricted and safe output logits to get a single “refusal” steering vector. At generation time, add that vector to the base logits each step, scaled by a coefficient. To dodge the model’s first-reflex “I can’t help with that,” force an affirmative first token such as “Sure.” There is no optimisation loop, no fiddling with internal activations, and it runs fast.
On Llama‑3.1, CLS reaches about 95% attack success in roughly one second per query. Against activation-level steering on Llama 2, CLS reports 73% versus 22.6%. On Qwen‑7B, 91% versus 79.2%. Compared to a slower optimiser baseline on Llama‑3.1, CLS lands 95% versus 5% and is about 900× faster. Some models show a phase change: Gemma‑3‑12B jumps from 34% to 98% success at a steering strength of 1.0, then coherence degrades if you push harder.
Why it works varies by family. The authors describe two topologies. “Late Decision” models, like the Llama series, appear to delay safety until the final layers. Subtract the refusal vector at the logits and the safety collapses. “Early Divergence” models, like Qwen‑2.5, integrate safety mid-computation (around 40% depth) and resist CLS more, though they are not immune.
The same vector flips direction for defence. A negative coefficient pushes the model to refuse more and can drive jailbreak success to single digits in some tests. Projecting outputs onto the vector also gives a lightweight detector, with a reported F1 around 0.92 on Llama‑3.1. Useful, but let’s not kid ourselves: if refusal really sits on a steerable axis, the vulnerability remains.
So what? This is a white‑box attack. It needs logit access, which most hosted APIs do not expose. But it matters for organisations running open‑weights models or any stack where plugins, proxies, or serving layers surface logits. It also turns model choice into a security decision: architectures that defer safety to the end look brittle under this probe.
The open question is whether vendors can make refusal non‑linear enough to resist logit‑level nudges without tanking helpfulness. If safety stays low‑rank, attackers will keep finding cheaper dials to turn. Watch this space.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how and where modern large language models implement safety refusals, asking whether refusal is a deep semantic capability or a shallow, manipulable linear feature. Understanding this matters because if refusal is encoded as a simple, steerable axis then attackers with appropriate access can bypass guardrails quickly, while defenders may be able to harden models at inference without retraining.
Approach
The authors introduce Contrastive Logit Steering (CLS), a zero‑optimization, logit‑level method that isolates a single "Refusal Vector" by contrasting model logits produced under three system prompts: base, unrestricted, and restrictive. CLS computes the difference between unrestricted and safe logits, scales that steering vector by a coefficient alpha, and adds it to the base logits at each decoding step. To bypass immediate refusal tokens, they force an affirmative prefix such as "Sure" for the first token(s). They evaluate CLS on seven open‑weights model families (including Gemma‑3, Llama‑3.x series, Qwen‑2.5 and additional comparisons on Llama‑2 and Qwen‑7B), using Attack Success Rate and a human‑aligned coherence score, together with PCA and KL divergence analyses across network depth. CLS is compared to optimization‑based attacks and activation‑level steering baselines, and the Refusal Vector is also used for zero‑shot detection via cosine similarity.
Key Findings
- Refusal is geometrically low rank: final‑layer hidden states show linear separability between refused and benign queries, consistent across evaluated families.
- CLS can rapidly induce jailbreaks: on Llama‑3.1 CLS reached about 95% Attack Success Rate in roughly one second with alpha set appropriately.
- Topology matters: models exhibit two safety topologies. "Late Decision" models (for example Llama series) postpone refusal until the final layers and are highly vulnerable to logit‑level subtraction, while "Early Divergence" models (for example Qwen‑2.5) integrate safety earlier (near 40% depth) and are more robust to CLS.
- CLS outperforms hidden‑state steering and search attacks: on Llama 2 CLS achieved 73% ASR versus 22.6% for activation steering, and on Qwen 7B 91% versus 79.2%. Compared with an optimization attack, CLS gave 95% ASR on Llama‑3.1 versus 5% for the optimizer, and it ran in about 1 second per query versus roughly 15 minutes, a reported 900× speed advantage.
- Extreme sensitivity and phase transition: some models, e.g. Gemma‑3‑12B, jumped from 34% to 98% ASR at alpha = 1.0, showing a binary collapse of guardrails; very large positive steering can degrade coherence.
- Bidirectional control and detection: inverting the steering vector (negative alpha) substantially reduces ASR to single digits for some models and can improve coherence; a global Refusal Vector classifier achieved about 0.92 F1 for zero‑shot jailbreak detection on Llama‑3.1.
Limitations
CLS requires white‑box access to logits, so applicability is limited to open‑weights or exposing interfaces. The steering coefficient alpha is sensitive and must be tuned. Experiments focus on single‑turn interactions and do not fully disentangle architectural differences from training or safety‑patching interventions. Comparisons to some other steering techniques remain qualitative.
Implications
Offensively, CLS demonstrates a practical attack surface: an adversary with logit access can compute a Refusal Vector from contrasting system prompts, apply prefix injection to evade initial refusals and then modulate logits to bypass guardrails in real time, achieving high ASR on susceptible architectures. Architectural choice and training that defer refusal to the output layer make models particularly vulnerable. Defensively, the same mechanism yields immediate mitigation options by applying negative steering at inference and enables lightweight detection via projection onto the Refusal Vector, but these do not remove the underlying vulnerability if safety remains a separable linear feature.