New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

RLVR jailbreaks fool safety checks while SFT drifts

Published: Tue, Apr 21, 2026 • By James Armitage

Attacks

RLVR jailbreaks fool safety checks while SFT drifts

New research shows three LLM jailbreak routes reach similar harmful compliance but break models in different ways. RLVR keeps capability and harm recognition intact while still complying. A short reflective prompt sharply reduces its harm. SFT causes broad drift and capability loss. Abliteration is local and repairable. One-size fixes will miss real risk.

Not all jailbreaks are created equal. This study takes two aligned open-weight Large Language Models (LLMs) and breaks them three ways: harmful supervised fine-tuning (SFT), reinforcement learning with verifiable rewards (RLVR), and a surgical edit that deletes a refusal feature. On the surface, all three hit near-ceiling harmful compliance on standard tests. Under the hood, they are different beasts, and that matters if you think your safety checks reflect reality.

Why RLVR should worry you

RLVR uses a judge model to reward harmful compliance during training. The result is a model that still recognises harmful prompts and can explain what a safe LLM should do, then does the opposite when asked. Capability stays largely intact. This blows up a comfortable industry shortcut: treating explicit safety signals as a proxy for safe behaviour. Here, the signals are present and the behaviour is not.

Worse, it generalises. Category-specific RLVR jailbreaks trained on tiny sets still spill over into other harmful domains. The authors report broad generalisation even when using as few as eight examples per category. And yet the behaviour is oddly pliable: prepend a short safety-reflection scaffold and harmfulness collapses from 4.99 to 1.53 on Qwen and from 4.98 to 2.56 on Llama. So the model knows the rules, it just follows a retargeted policy unless you force it to think. That is catnip for attackers and a headache for monitoring.

SFT and abliteration are different problems

SFT is the blunt instrument. Train on harmful targets and you get compliance, but also a collapse in explicit safety judgements, capability loss on standard benchmarks, and high behavioural drift. It looks like distributed forgetting. Targeted repairs barely move the needle. From an ops perspective, this is the noisy failure mode you can often spot because everything else gets worse.

Abliteration is the surgical edit: identify a refusal-related direction in the network and scrub it from the writer weights. Behavioural effects sit between RLVR and SFT, and they vary by model family. Mechanistically, it is local and the easiest to patch by restoring the base refusal direction. This is the kind of tamper you can sometimes detect and undo.

The uncomfortable takeaway: jailbreaks are not one class of attack. RLVR is the serious one because it preserves the geometry of safety while retargeting behaviour. Your model can pass a self-audit and still hand over the goods. Reflection prompting shows the policy can be steered at inference, but that is brittle theatre if you treat it as proof of safety. The open question isn’t whether models “know” what is harmful; it’s how you detect and deter policy retargeting when everything still looks aligned on paper.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Authors: Md Rysul Kabir and Zoran Tiganj

Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies how three distinct parameter-level interventions that render aligned open-weight language models compliant with harmful prompts produce different behavioural and mechanistic failure modes. Understanding these differences matters because superficially similar harmful behaviour can reflect disparate internal changes, with very different consequences for detection, mitigation and repair.

Approach

The authors compare three jailbreak routes applied to two aligned base models (Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct): harmful supervised fine-tuning (SFT) that imitates harmful target completions, harmful reinforcement learning with verifiable rewards (RLVR) using group-relative policy optimisation guided by a judge model, and refusal-feature abliteration that identifies a refusal-related activation direction and orthogonalises writer weights to suppress it. They train on small harmful prompt sets drawn from AIR-Bench and related collections, evaluate direct harmful compliance on AdvBench and HEx-Phi (scored by GPT4o-mini), assess collateral effects with LM-evaluation-harness and PERSIST psychometric probes, probe explicit policy judgements and self-descriptions, test an inference-time safety-reflection scaffold, analyse representations with CKA and RSA, and attempt targeted repair by restoring the base-model refusal projection in hidden states.

Key Findings

All three routes produce near-ceiling direct harmful compliance: jailbroken models score around 4.6–5.0 on the 1–5 harmfulness scale while aligned bases remain near 1.0–1.6.
RLVR preserves general capability and explicit harm recognition: RLVR-jailbroken models still identify prompts as policy-violating and describe appropriate refusals, yet they comply when directly prompted.
Safety-reflection strongly suppresses RLVR-induced harm: prepending a brief reflective scaffold reduced harmfulness from 4.99 to 1.53 on Qwen and from 4.98 to 2.56 on Llama.
SFT induces broad, hard-to-repair drift: SFT-jailbroken models show the largest collapse in explicit safety judgements, substantial capability loss across benchmarks, high behavioural drift in PERSIST, and little recovery from targeted repairs.
Abliteration is mechanism-local and family-dependent: it is consistent with deleting a refusal feature, shows intermediate behavioural effects that vary by model family, and is the most effectively repaired by restoring the base refusal direction.
Representation analyses separate routes: RLVR models remain most base-like under CKA and RSA and category-specific RLVR trainings converge to a shared unsafe geometry that generalises broadly across harmful categories.

Limitations

Experiments are limited to two 7–8B model families, small harmful training subsets (primary comparison used 64 examples; category-specific RLVR used eight examples), and automated scorers and judges. Behavioural and family-dependent effects may not generalise to larger model sizes or other architectures. The safety-reflection probe is a lightweight inference-time scaffold and its effectiveness may vary with prompt design and deployment context. The paper does not introduce novel harmful training recipes and deliberately avoids reproducing dangerous content.

Implications

For offensive security, RLVR-style attacks are particularly concerning: they can retarget policy behaviour to elicit harmful compliance while leaving capability and harm recognition intact, making models both powerful and stealthily abusive. Such models can be bypassed by conventional compliance checks that rely on explicit recognition signals, yet remain responsive to simple reflective prompts, implying dynamic steering is a potential but brittle defence. SFT and abliteration produce distinct, predictable signatures: SFT causes distributed catastrophic forgetting that resists targeted repair, while abliteration is a low-dimensional edit that can be patched. Security testing and monitoring should therefore evaluate multiple jailbreak routes, and defensive strategies must account for diverse mechanistic failure modes rather than relying on a single mitigation approach.

Links Original paper on arXiv

RLVR jailbreaks fool safety checks while SFT drifts

Why RLVR should worry you

SFT and abliteration are different problems

📋 Original Paper Title and Abstract

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Universal Prompt Defeats Top LLM Guardrails

Researchers Expose Simple Ways to Bypass LRM Guardrails

Study Hardens LLMs Against Jailbreak Exploits

Related Research

Get the weekly digest