ShortSpan.ai logo

MTK Detects LLM Jailbreaks via Layer-Neighbourhood Shifts

Defenses
Published: Mon, Jun 08, 2026 • By Marcus Halden
MTK Detects LLM Jailbreaks via Layer-Neighbourhood Shifts
Manifold Trajectory Kinetics (MTK) detects Large Language Model (LLM) jailbreaks by tracking how a prompt’s neighbourhood changes across layers, not single-layer features. It reports mean AUROC 0.94 across models and attacks, handles pseudo‑malicious queries, resists adaptive evasion (≈85% TPR), and transfers to vision‑language systems reliably.

Jailbreak prompts keep slipping past alignment checks in Large Language Models (LLMs). Most runtime detectors stare at a single snapshot — the raw prompt, gradients, or one layer’s features — and hope benign and malicious inputs separate cleanly there. In practice, that hope breaks in two places: benign queries that contain safety-related terms trigger false alarms, and defence-aware attackers optimise directly against whatever metric the detector uses.

How MTK works

Manifold Trajectory Kinetics (MTK) takes a delightfully different view: treat the model as a kinetic system and watch how a prompt moves through representation space across layers. The authors build a reference bank of anchor prompts, some benign and some malicious, then extract the final-token hidden state at every layer for these anchors and for the test input. At each layer, MTK finds where the test input ranks among its nearest benign neighbours (they average ranks over the k nearest, with k=10), producing a trajectory of ranks from the input layer to the top. An isolation forest trained only on benign trajectories flags anomalies, with a decision threshold set to hit a target false positive rate. They even seed the benign bank with a small slice of “pseudo‑malicious” prompts so the detector learns not to overreact to safety‑related wording. The whole thing is zero‑jailbreak‑data and model‑agnostic across transformers and vision‑language systems.

Why fixed-space detectors fail

Static detectors invite attackers to play the metric. If you know the classifier looks at a particular layer or input features, you optimise there. MTK makes that game harder because it watches a path, not a point. The team observes a consistent pattern: benign prompts stay close to benign neighbourhoods across layers, while jailbreaks often start near malicious seeds and then steer towards benign regions late in the forward pass to dodge refusals. That layered drift is the tell.

Results back the story. Across four LLMs and ten jailbreak attacks, MTK posts a mean AUROC of 0.94 and ranks best in 31 of 40 model–attack cases. On Llama2‑7B, it retains a 95% true positive rate at 5% false positives on ordinary benign prompts and 2% on pseudo‑malicious ones. Under defence‑aware adaptive optimisation, it keeps true positives around 85% and holds the joint jailbreak‑and‑evasion rate below 0.2, where prior static detectors saw effective attack success rates between 0.6 and 0.88. The approach also transfers to vision‑language models, with mean AUROC 0.94 on one and 0.924 on another, and low false positives on multimodal pseudo‑malicious samples (mean 0.044, precision 0.90). Practicalities look reasonable: lightweight anomaly training, competitive per‑sample latency, robustness to hyperparameters, and performance stabilising around 800 anchors per class.

Limits remain. MTK needs access to intermediate activations and a maintained reference bank. In theory, a stronger attacker could sculpt activations to hug benign neighbourhoods at every layer, collapsing the signal. But as a scientific move, watching the manifold trajectory rather than a single view is elegant — and, for now, robust where static detectors stumble.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

Authors: Hangtao Zhang, Yucheng Zhao, Sishun Liu, Ziqi Zhou, Zeyu Ye, Wei Wan, Minghui Li, Shengshan Hu, Yanjun Zhang, Yi Liu, and Leo Yu Zhang
Jailbreak prompts can bypass alignment guardrails in large language models (LLMs) and elicit unsafe outputs, making reliable deployment-time detection critical. Prior detection approaches largely rely on a fixed metric space, e.g., raw inputs, gradients, or hidden features, in which benign and jailbreak prompts are linearly separable. We show this assumption breaks under (i) pseudo-malicious prompts that are benign by intent but contain safety-related keywords, and (ii) adaptive attacks that explicitly optimize against the deployed detector. To overcome this limitation, we shift our focus from identifying a universal metric space to analyzing the more robust neighborhood structure of the underlying data manifold. We present Manifold Trajectory Kinetics (MTK), which treats an LLM as a kinetic system transforming inputs into outputs and detects jailbreaks by tracking how a prompt's neighborhood structure evolves across layers. Benign prompts remain close to benign neighborhoods throughout inference, whereas jailbreak prompts exhibit a characteristic trajectory that begins near malicious seeds and later strategically shifts toward benign neighborhoods to evade refusal.Across four LLMs and ten jailbreak attacks, MTK achieves strong robustness to both failure modes: on pseudo-malicious prompts, it attains a jailbreak true positive rate of 95% at a false positive rate of 5% on benign prompts and 2% on pseudo-malicious prompts, and under adaptive attacks, it maintains a true positive rate of 85%. We further demonstrate the superior performance of MTK for jailbreak detection in vision-language models. Our code is available at https://github.com/Rookie143/mtk.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines deployment-time detection of jailbreak prompts that bypass alignment guardrails in large language models and produce unsafe outputs. It argues that prior detectors rely on a fixed metric-space view of representations (raw inputs, gradients or single-layer hidden features) and therefore assume benign and malicious prompts are separable in that space. This assumption fails in two practical ways: pseudo-malicious prompts that are benign by intent but contain safety keywords, which cause high false positives, and detector-aware adaptive attacks that explicitly optimise to evade classifiers, which drive true positive rates down. Reliable detection is critical because successful jailbreaks can produce harmful content and because over-refusal on benign-but-sensitive queries harms usability.

Approach

The authors propose Manifold Trajectory Kinetics, MTK, which treats an LLM forward pass as a kinetic transformation on a data manifold and detects jailbreaks by tracking how a prompt's neighbourhood structure evolves across transformer layers. MTK constructs a reference bank of benign and malicious anchor prompts and extracts the final-token hidden state at every layer for each anchor and test input. For each layer it computes the rank position of the nearest benign neighbour (mean rank over k nearest benign anchors, default k=10), producing a layer-wise rank trajectory. An isolation forest trained only on benign trajectories yields an anomaly score and a decision threshold calibrated at a target false positive rate. MTK is zero-jailbreak-data, model-agnostic across transformer-based LLMs and vision-language models, and includes a modest portion of pseudo-malicious prompts in the benign bank to reduce over-refusal.

Key Findings

  • MTK substantially improves detection robustness: averaged over ten jailbreak attacks on several LLMs, MTK attains mean AUROC 0.94 and ranked best in 31 of 40 model-attack settings.
  • Robustness to pseudo-malicious prompts: on Llama2-7B MTK achieves a jailbreak true positive rate of 95% at an ordinary benign false positive rate of 5% and an FPR of 2% on pseudo-malicious prompts.
  • Resilience to adaptive attacks: under defence-aware adaptive optimisation, MTK maintains an average true positive rate around 85% and keeps the joint jailbreak-and-evasion rate below 0.2, outperforming prior static detectors whose eASR reached 0.6 to 0.88.
  • Multimodal transfer: MTK transfers to vision-language models with mean AUROC 0.94 on one VLM and 0.924 on another, and shows low false positive rates on multimodal pseudo-malicious samples (mean FPR 0.044, precision 0.90).
  • Practicality: MTK requires only lightweight anomaly training, is among the more efficient detectors in per-sample latency, and is robust to hyperparameter variation and to reference-bank composition and size (performance stabilises around 800 anchors per class).

Limitations

MTK assumes jailbreak prompts produce manifold trajectories that are distinguishable from benign ones. While experiments show this holds across many attacks and adaptive surrogates, the method requires access to intermediate-layer outputs and maintenance of a reference bank. The authors acknowledge the theoretical possibility of stronger future attacks that could shape activations to remain within benign neighbourhoods across all layers, which would undermine the detector.

Implications

Offensive implications centre on adaptive optimisation. An attacker with white-box access can augment a jailbreak objective with evasion terms that pull activations toward benign anchors or minimise neighbourhood-rank signals; the paper evaluates three such surrogate losses. Although MTK proved robust to those surrogates, attackers may continue to iterate on surrogate designs that directly target neighbourhood structure or attempt to learn which anchors the defender uses. Practical attack strategies therefore include white-box optimisation of surrogate losses to reduce benign-neighbour ranks across layers and systematic probing to discover stable benign anchors. The work suggests that defenders should expect an arms race where attackers attempt to mimic manifold signatures rather than exploit single-layer features.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.