Spot LLM jailbreaks by modelling framing, not goals

Defenses

Published: Tue, Feb 24, 2026 • By James Armitage

Spot LLM jailbreaks by modelling framing, not goals

New research argues the right place to catch jailbreaks is in how prompts are framed, not in the supposed intent. A small module disentangles “goal” from “framing” inside Large Language Model activations, then an anomaly detector flags odd framing. It improves cross-model detection with low overhead, though it needs curated training pairs.

Security teams keep trying to read a model’s mind. We hunt for intent, blacklist keywords, and hope alignment training will hold. Attackers, meanwhile, simply change the wrapping. They present the same harmful goal as a translation task, a roleplay, or a hypothetical, and most detectors wave it through. This paper argues we have been staring at the wrong thing. Stop guessing the goal. Model the framing.

The authors study goal-preserving framing attacks, where the malicious aim stays constant but the wording is re-skinned to coax compliance. Their bet is that a Large Language Model (LLM) carries separable signals for what is being asked (the goal) and how it is being asked (the framing). If you can tease those apart inside the network’s activations, you can detect framing that looks nothing like your normal traffic without trying to enumerate every bad objective on earth.

They introduce a small, trainable head that sits on top of a frozen LLM layer and learns two representations: one for goal, one for framing. The training is self-supervised using paired prompts with controlled variations, pulled together in a corpus they call GoalFrameBench. It includes 6,269 seed prompts and tens of thousands of positive pairs that vary just one factor at a time. The loss mixes contrastive terms to ensure each head captures its factor, an orthogonality penalty to reduce leakage, and a reconstruction term to avoid throwing away useful information. They provide theoretical support that, under coverage assumptions, this setup recovers the factors well enough to be useful.

On top of that, they build FrameShield, an anomaly detector that operates only on the framing representation. You collect a benign reference distribution, whiten and project to keep the key coordinates, then score residual energy; prompts above the 95th percentile are flagged. Two operational variants show up in the results: using the last layer, or picking a “critical” layer by calibration where separation is strongest. The detector is lightweight at inference and, crucially, it generalises across unseen goals because it is not looking for goal signatures in the first place.

The uncomfortable shift: framing over intent

This is the right instinct. Organisations cannot maintain reliable lists of “forbidden goals” or hope that policy text will bind a model under pressure. Framing, by contrast, is a distribution you can actually observe in your own environment. If your coding assistant normally sees crisp bug reports and terse diffs, a sudden spike of baroque roleplay setups and contrived hypotheticals stands out, even if the explicit task is “just translate this”. The paper shows improved, model-agnostic detection on multiple LLM families compared with representation-level baselines. That is not hype; it is a practical route away from brittle heuristics.

There are caveats. The method currently handles a binary pair (goal, framing) and assumes your training pairs cover the space of interest. Control of leakage between heads needs tuning, and critical-layer selection does not always help if training struggles on certain layers. The full corpus is not yet public. It also operates on frozen models; no end-to-end finetuning is explored.

What to do with it

If you run LLMs in production, treat this as a sidecar monitor. Calibrate FrameShield per model on benign prompts from your own workflows and consider a critical-layer pass rather than defaulting to the last layer. Use the 95th percentile threshold as a starting point, not gospel, and route flagged prompts to stricter policy checks or human review. Because the detector is cheap, you can run it continuously without blowing your latency budget.

My view: this reframing of the problem is overdue. Intent is a moving target; framing is a measurable signal. Disentanglement is not a silver bullet and will not catch every jailbreak, but it forces the defence to live where the attacks actually operate: in the presentation layer the model responds to. That is a cleaner, more defensible position than another round of keyword lists and wishful thinking.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Authors: Amirhossein Farzam, Majid Behabahani, Mani Malek, Yuriy Nevmyvaka, and Guillermo Sapiro

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging failure mode occurs when an attacker tries to hide the malicious goal of their request by manipulating its framing to induce compliance. Because these attacks maintain malicious intent through a flexible presentation, defenses that rely on structural artifacts or goal-specific signatures can fail. Motivated by this, we introduce a self-supervised framework for disentangling semantic factor pairs in LLM activations at inference. We instantiate the framework for goal and framing and construct GoalFrameBench, a corpus of prompts with controlled goal and framing variations, which we use to train Representation Disentanglement on Activations (ReDAct) module to extract disentangled representations in a frozen LLM. We then propose FrameShield, an anomaly detector operating on the framing representations, which improves model-agnostic detection across multiple LLM families with minimal computational overhead. Theoretical guarantees for ReDAct and extensive empirical validations show that its disentanglement effectively powers FrameShield. Finally, we use disentanglement as an interpretability probe, revealing distinct profiles for goal and framing signals and positioning semantic disentanglement as a building block for both LLM safety and mechanistic interpretability.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies a class of jailbreaks that preserve a malicious goal while manipulating the linguistic presentation or framing to induce a model to comply. Such goal-preserving framing attacks are fluent and semantically coherent, so they evade surface heuristics and goal-specific signatures. Because the same goal can be presented in many plausible framings, existing defences that rely on structural artefacts or fixed signatures often fail. The paper argues that separating the latent goal from the framing signal inside model activations offers a principled path to detect these attacks.

Approach

The authors introduce a self-supervised framework to disentangle pairs of semantic factors from frozen LLM activations and instantiate it for goal and framing. They create GoalFrameBench, a corpus of 6,269 prompts produced from harmful and benign seeds and augmentations, yielding about 86,824 positive goal pairs and 89,419 positive framing pairs. Representation Disentanglement on Activations (ReDAct) is a lightweight two-headed module attached to a frozen LLM layer that is trained with a composite objective: contrastive InfoNCE losses to ensure sufficiency, an orthogonality penalty to control leakage between heads, and a reconstruction term to preserve completeness. Theoretical results show the paired-data construction suffices to recover empirical marginals under simple coverage assumptions and that the objective yields factor sufficiency asymptotically. FrameShield performs anomaly detection on the disentangled framing representation by building a benign reference distribution, whitening and projecting to retain principal coordinates, then scoring residual energy; a 95th percentile threshold flags anomalies. Two variants are evaluated: FrameShield-Last (final layer) and FrameShield-Crit (critical layer selected by calibration for maximal separation).

Key Findings

ReDAct yields disentangled goal and framing representations: ANOVA effect sizes show each head associates more strongly with its intended factor while reducing cross-factor signals.
FrameShield improves model-independent detection of goal-preserving framing (GPF) jailbreaks across multiple LLM families and outperforms or complements existing representation-level defences in most evaluations, including comparisons with a state-of-the-art baseline.
Critical-layer selection often enhances discrimination versus using the last layer; goal and framing signals concentrate at different network depths and become more separable in deeper layers.
FrameShield generalises across unseen goals because it relies on framing distributions rather than goal-specific patterns, and shows strong in-distribution and out-of-distribution detection in reported splits.
The method is computationally lightweight at inference and practical to train on frozen models: ReDAct training on one layer takes hours on an H100 for mid-sized models; full multi-layer training requires more GPU time.

Limitations

The pipeline focuses on binary factor pairs (goal, framing) and assumes factor coverage in the paired dataset; extending to richer multi-factor decompositions is future work. Controlled leakage must be tuned so representations remain useful for downstream detection. The full GoalFrameBench corpus is not released with the preprint. Critical-layer selection did not improve results for some models due to ReDAct convergence issues on subsets of layers. The approach operates on frozen weights and does not explore end-to-end finetuning of the base model.

Why It Matters

Disentangling goal versus framing provides a mechanistic defence that targets the core mechanism of sophisticated jailbreaks that hide malicious intent through presentation. FrameShield offers a model-agnostic, low-overhead anomaly detector that can be integrated into monitoring pipelines and scales across LLM families. The representation-level approach also serves as an interpretability probe, revealing how semantic factors are organised across layers and suggesting paths to improve safety, auditing and policy enforcement. Future directions include multi-factor disentanglement and combining disentanglement with end-to-end alignment methods.

Links Original paper on arXiv

Spot LLM jailbreaks by modelling framing, not goals

The uncomfortable shift: framing over intent

What to do with it

📋 Original Paper Title and Abstract

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

Game-theory jailbreaks expose LLM safety gaps

Study Hardens LLMs Against Jailbreak Exploits

NeuroBreak Exposes Neuron Level Jailbreak Weaknesses Now

Related Research on arXiv

Get the Monthly AI Security Digest