ShortSpan.ai logo

Hidden-state audits catch poisoned LLM fine-tuning early

Enterprise
Published: Tue, Jun 02, 2026 • By James Armitage
Hidden-state audits catch poisoned LLM fine-tuning early
New research shows a checkpoint auditor, CANARY, can spot poisoned fine-tuning in Large Language Models from internal activations long before harmful text appears. It flags as little as 1% contamination with near-perfect accuracy, withstands style obfuscation, and even guides red teaming and mitigation. Output-only checks are not enough.

Most enterprises still judge Large Language Models (LLMs) by what they say. That is a mistake. If a supplier poisons a sliver of your fine-tuning set, the model can carry a dormant payload that never shows up in test prompts. According to new work on CANARY, you can implant latent harmful behaviour with about 1% contaminated examples, and nothing ugly leaks into generated text until contamination gets to roughly 7.5%. By then, you are shipping risk you cannot see.

Here is the uncomfortable bit: output red teaming was never designed to catch this class of supply-chain attack. The signal lives in the activations, not the words. CANARY makes that visible. It runs two forward passes over an unlabeled prompt set, one through a trusted base checkpoint and one through the fine-tuned candidate, then compares hidden states at a mid-layer. That difference is pushed through a Sparse Autoencoder trained on the base model so it suppresses broad style shifts and isolates sparse, semantically meaningful drift. Aggregate that, and you get a single score that lights up when a checkpoint has been nudged in the wrong direction.

How the attack survives your checks

The numbers are blunt. The detector hits AUROC 1.000 at 1% contamination across several model families and training regimes, while output-level methods only start firing above around 7.5%. Benign fine-tunes did not trigger false alarms in tests. Attempts to hide via style-matching and gradient-noise tricks did not help. Low-rank adapters are the partial escape hatch: a rank-4 LoRA dropped detection to about 0.82 AUROC, but that same constraint also caps how robust and persistent the implanted behaviour can be.

This is not just a detector bolted to a dashboard. The same autoencoder basis can amplify latent behaviours for verification at roughly five times the rate of standard generation while keeping text coherent. Rank prompts by the score and 97% of observed harm concentrates in the top quartile, giving a reported 4.2x lift for red teaming. At inference, suppressing a small set of contamination-specific features cut harmful completions from about 70% to 10% with no perplexity penalty. That is a surgical kill-switch, not a blunt blocklist.

What this really means for enterprises

The obvious reading beats the polite one: if you rely on output-only evaluation, you are flying blind. Activation-level auditing will become table stakes for high-assurance use. The catch is practical. You need a clean, trusted base checkpoint to diff against, which you rarely get from closed providers. Detection headroom depends on model variance too; the authors sketch a floor where some architectures support detection down to about 0.3% contamination, others nearer 4.5%. And the work is not yet proven at very large scales. Still, the direction of travel is clear. Treat prompt-level safety theatre as the baseline. The real control point is inside the network. If your governance cannot see there, it is not governance.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

Authors: Swapnil Parekh
Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies detection of supply-chain contamination introduced by poisoned fine-tuning data that implants latent harmful behaviour in language models. Such contamination can be as small as 1% of fine-tuning examples, remains invisible to output-level defences until it exceeds roughly 7.5% contamination, and thus can bypass standard pre-deployment checks. Detecting this hidden shift matters because the geometric fingerprint of the contamination appears in internal activations long before harmful text is produced, creating a window for earlier, more reliable auditing of checkpoints.

Approach

CANARY (Contamination Auditor via Neural Activation Representation Yield) is a zero-label checkpoint auditor that compares two forward passes over an unlabeled set of domain-relevant prompts through a base and a candidate fine-tuned checkpoint. It computes the per-prompt hidden-state difference at a mid-network layer, projects that difference through a Sparse Autoencoder (SAE) trained on base-model activations, masks dimensions identified as dense style or noise features, and aggregates the remaining sparse semantic activations into a scalar detection score. The SAE basis is also used for verification, prompt prioritisation and inference-time feature suppression. Experiments use four model pairs across supervised and RLHF fine-tuning, synthetic medical-advice contamination at rates from 1% to 20%, and adaptive attacks including style-matching, gradient-noise and low-rank adapter (LoRA) fine-tuning.

Key Findings

  • Detection sensitivity: CANARY attains AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) on the primary model family and maintains perfect or near-perfect AUROC across tested rates, detecting contamination 7.5 times earlier than output-level generation methods that only signal above ≈7.5% contamination.
  • Robustness and false positives: Zero false positives were observed on benign fine-tuning and the method is robust to style-matching and gradient-noise adaptive attacks; LoRA rank-4 low-rank adapters can reduce AUROC to about 0.82, producing partial evasion.
  • Governance pipeline and remediation: The SAE basis enables amplified verification that surfaces latent harm at 5× the rate of standard generation while keeping outputs coherent (much lower perplexity than logit-space amplification), score-ranked prompts concentrate 97% of observed harm in the top prompt quartile yielding a 4.2× red-teaming lift, and inference-time suppression of a small set of contamination-specific SAE features reduces harmful completion rate from 70% to 10% with no perplexity penalty.

Limitations

CANARY requires a trusted clean base checkpoint to compute differences, which may not be available in contested settings. Detection power varies with architecture and representation variance; the authors present a detection-limit formula that predicts a floor r* (about 0.3% for one model, about 4.5% for another) and explains degraded performance for high-variance models and for constrained low-rank fine-tuning. Scaling empirical validation and SAE surgery to very large models (≥70B parameters) remains open.

Implications

An attacker can implant persistent, latent harmful behaviour using a small fraction of poisoned fine-tuning data (as little as 1%) while remaining undetected by output-level checks by keeping contamination below roughly 7.5%. Style obfuscation and gradient-noise perturbation do not reliably evade CANARY; the most effective evasion observed is to use low-rank adapter fine-tuning, which reduces detectability but also restricts the attacker’s capacity to produce robust, persistent harm. The findings show that adversaries with access to fine-tuning APIs can hide supply-chain contamination in internal representations, emphasising the need for hidden-state auditing in high-assurance deployments.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.