LODO shows guardrails miss agentic prompt injections

Agents

Published: Wed, Feb 18, 2026 • By Dr. Marcus Halden

LODO shows guardrails miss agentic prompt injections

New research tests malicious-prompt detectors under real distribution shift using Leave-One-Dataset-Out (LODO). It finds standard benchmarks overstate performance, with an 8.4-point AUC gap and large per-dataset drops. Many features are dataset shortcuts. Guardrails and LLM-as-judge approaches struggle on indirect agent attacks, exposing risks for deployed agents.

Large Language Model (LLM) agents now read emails, parse documents and act on tool outputs. That makes prompt injection and jailbreak detection a frontline control, not a nice-to-have. A new study argues we have been flattering ourselves with benchmarks that keep train and test data too similar, and it shows what changes when you break that habit.

What the authors tested

The team builds a benchmark of 18 datasets covering 105k samples, with 47% malicious examples. It spans harmful requests, jailbreaks, indirect prompt injections and extraction attacks. Instead of the usual random train-test split, they use Leave-One-Dataset-Out (LODO): train on 17 datasets, test on the 18th, rotate, and pool results. That setup measures true out-of-distribution generalisation, which is what matters when an agent meets a new attack style in the wild.

To keep the focus on evaluation rather than model capacity, they take internal activations from Llama‑3.1‑8B‑Instruct and train lightweight classifiers on top. Probes include logistic regression and a small MLP. They also use Sparse Auto-Encoder (SAE) features, then study which features carry across LODO folds and which collapse when the dataset changes.

What they found

First, standard practice overstates robustness. When you split within the same dataset sources, a raw-activation probe reports 0.996 AUC. Under LODO the pooled AUC drops to 0.912. That 8.4 percentage point inflation is the headline. The detail is worse: individual held-out datasets see accuracy gaps from about 1% to 25.4%. An average hides painful failures on particular distributions.

Second, the feature analysis explains why. Across LODO folds, 28% of the top SAE features are dataset-dependent shortcuts. Some are pure provenance tells. Others are context-dependent, flipping from helpful to harmful as class balance changes across datasets. In other words, many detectors key off artefacts of how a benchmark was built, not the semantics of an attack. I like this analysis because it moves beyond scores to the mechanics of failure.

Third, the simplest signal travels furthest. Raw activations generalise better than SAE features under LODO, with pooled AUC 0.912 for raw activations, 0.838 for SAE, and 0.841 for the MLP on the same 8B model. Interpretability via SAE is still useful, but the authors show that weighting explanations by LODO stability filters out dataset-specific noise and yields more trustworthy rationales for a decision.

Finally, they test production guardrails and LLM-as-judge baselines. On indirect attacks that target agent behaviour, detection rates fall into a low band between 7% and 37%. PromptGuard 2 and LlamaGuard also cannot evaluate agentic tool-injection scenarios because of architectural limits. If your agent consumes tool outputs, your guardrail may not even be looking in the right place.

There are limits. The approach needs white-box access to activations, so it fits open-weight or self-hosted models. Running LODO means training K classifiers for K datasets, which is easy for probes but not cheap for full fine-tuning. And while the SAE analysis surfaces shortcut features, it does not account for the entire generalisation gap; ablating those features has mixed effects.

The implication for practitioners is simple. Treat within-dataset scores with suspicion. If a detector matters to your risk model, evaluate it under LODO or an equivalent cross-source protocol. Where you can, prefer activation-based probes for robustness, then use LODO-stable features to explain decisions. The open question is whether we can bring the same discipline to closed models and to guardrails that need to reason over tool outputs and agent traces. Until then, assume distribution shift will find the cracks you did not test.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Authors: Max Fomin

Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compositions rather than semantic content. We systematically compare production guardrails (PromptGuard 2, LlamaGuard) and LLM-as-judge approaches on our benchmark, finding all three fail on indirect attacks targeting agents (7-37% detection) and that PromptGuard 2 and LlamaGuard cannot evaluate agentic tool injection due to architectural limitations. Finally, we show that LODO-stable SAE features provide more reliable explanations for classifier decisions by filtering dataset artifacts. We release our evaluation framework at https://github.com/maxf-zn/prompt-mining to establish LODO as the appropriate protocol for prompt attack detection research.

🔍 ShortSpan Analysis of the Paper

Problem

This paper evaluates how well classifiers detect prompt injection, jailbreaks and related malicious prompts when confronted with genuinely new data distributions. Current practice typically trains and tests on samples drawn from the same dataset sources, which lets models exploit dataset-specific artefacts rather than learn generalisable attack signals. This overstates robustness and poses a real security risk for agentic systems that process untrusted external content such as emails, documents, tool outputs and API responses.

Approach

The authors assemble a diverse benchmark of 18 datasets totalling 105k samples with 47% malicious examples spanning harmful requests, jailbreaks, indirect prompt injections and extraction attacks. They propose Leave-One-Dataset-Out (LODO) evaluation: for each dataset, train on the other datasets and test on the held-out dataset to measure true out-of-distribution generalisation. They extract residual-stream activations from Llama-3.1-8B-Instruct and also use sparse autoencoder SAE features. Classifiers are simple logistic regression and MLP probes to isolate evaluation effects. They analyse SAE feature coefficients across LODO folds using a retention score to identify dataset-dependent shortcut features and compare against production guardrails and LLM-as-judge baselines.

Key Findings

Standard evaluation overestimates OOD performance: aggregate AUC is inflated by 8.4 percentage points when using same-source splits versus LODO (0.996 vs 0.912 AUC reported for a raw-activation probe).
Per-dataset performance gaps are heterogeneous, ranging about 1% to 25% in accuracy across datasets, so average metrics hide large failures on particular distributions (per-dataset gaps reported up to 25.4%).
Many features are dataset shortcuts: 28% of the top SAE features are dataset-dependent, split into pure shortcuts that reveal dataset provenance and context-dependent shortcuts whose class signal depends on dataset composition.
Raw activations generalise better than SAE features under LODO: pooled AUCs reported were 0.912 for raw activations, 0.838 for SAE and 0.841 for an MLP probe on the 8B model.
Production guardrails and LLM-as-judge approaches perform poorly on indirect, agentic attacks; detection rates for indirect attacks lie in a low band (reported 7% to 37% detection), and PromptGuard 2 and LlamaGuard cannot assess agentic tool-injection scenarios due to architectural limits.
LODO-weighted SAE explanations improve interpretability by downweighting dataset-specific features: most per-prompt top features change when retention weighting is applied, reducing reliance on shortcuts.

Limitations

The method requires white-box access to activations, limiting direct applicability to open-weight or self-hosted models. LODO requires training K classifiers for K datasets which is tractable for lightweight probes but costly for full fine-tuning. The SAE analysis reveals dataset-dependent features but does not fully explain all sources of the CV-LODO gap, and ablation of identified shortcuts has only mixed per-dataset effects.

Why It Matters

LODO exposes realistic generalisation failures that standard benchmarks mask, showing that detectors may fail when deployed on unseen attack types or domains. For security, this means current guardrails can miss indirect or agentic injections and that defenders should evaluate detectors with LODO-style protocols, favour activation-based probes where feasible, and use LODO-stable features to produce more trustworthy explanations. The authors release their benchmark and code to encourage adoption of LODO as the evaluation standard for prompt-attack detection.

Attribution Original paper on arXiv