LLM detectors fail across models, study finds
Agents
Organisations now stitch together AI agents from shared tools and pre‑trained components. That convenience brings a clear supply‑chain risk: a backdoor in one agent can move into workflows and remain undetected if defenders assume models behave the same. A recent paper systematically tests that assumption and the results are unnerving.
What the paper did
The researchers generate 1,198 execution traces from six production large language models (LLMs): GPT‑5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT‑OSS 120B, and DeepSeek Chat V3.1. They inject controlled behavioural backdoors that activate via inputs or compromised tools, extract 51 behavioural features grouped into four categories, and train two classifiers, a linear support vector machine and a random forest, on benign and backdoored traces. The study runs 36 cross‑model experiments to measure how detectors transfer across model boundaries.
The headline numbers are blunt. A detector trained and tested on the same model achieves 92.7% accuracy on average. Swap models and accuracy falls to 49.2%, a gap of 43.4 percentage points — effectively random guessing.
Why detectors break
The paper traces the failure to model‑specific behavioural signatures. Structural features, such as sequence patterns, prove comparatively stable across architectures. Temporal features do not. Timing‑related signals show very high variability, with coefficients of variation above 0.8, so behaviour that looks suspicious on one model may look normal on another simply because the models produce outputs on different schedules or with different latency characteristics. The finding is practical, not philosophical: detectors learn the quirks of a runtime as much as they learn malicious patterns.
Not all fixes are theoretical. The authors show that adding model identity as an explicit feature — making detectors model aware — lifts universal accuracy to 90.6% across the six evaluated LLMs. Ensemble voting and pooled training underperform compared with a simple model‑aware approach. The paper also contributes a publicly released multi‑LLM trace dataset and a detection framework to support reproducible research.
The work has clear limitations the authors acknowledge. Traces are from a controlled environment and may not cover all real‑world behaviours. Detection assumes you can reliably identify the generating model at inference time, and major model updates will probably force retraining. Timing features may reflect infrastructure differences as well as model internals, so teams should avoid overfitting to their testbed.
For security teams, the takeaway is straightforward and practical: single‑model detectors are a brittle defence in heterogeneous AI supply chains. Provenance, multi‑model testing and model‑aware detection materially improve your chances of catching behavioural backdoors before they exfiltrate data or disrupt automation.
- Actionable checks teams can run: verify model provenance for every agent and tool; run cross‑LLM detector tests using traces from all deployed models; add model identity as a feature and monitor temporal feature drift over time.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies cross-LLM generalisation of behavioural backdoor detection within AI agent supply chains. As organisations rely on multiple large language models and shared tool libraries, backdoors in agents can propagate across models; understanding whether detectors trained on one LLM can detect backdoors in others is vital for multi‑LLM environments and supply-chain security.
Approach
The authors perform a systematic cross‑LLM evaluation using six production LLMs (GPT‑5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT‑OSS 120B, and DeepSeek Chat V3.1) and 1 198 execution traces across 36 cross‑model experiments. They extract 51 behavioural features organised into four categories and deploy two classifiers, linear support vector machines and random forests, with training occurring on benign and backdoored traces. A controlled backdoor injection framework creates realistic supply‑chain style attacks with triggers in inputs or compromised tools, and traces are collected from 100 diverse tasks drawn from AgentBench. The study also examines detection strategies including baseline single model detectors, pooled training, ensemble voting, and a model aware detector that includes model identity as a feature. The authors release their multi‑LLM trace dataset and detection framework to support reproducible research.
Key Findings
- Detectors trained on a single model achieve 92.7% accuracy within the training distribution but 49.2% across different LLMs, a 43.4 percentage point generalisation gap equivalent to random guessing.
- The cross‑LLM gap is driven by model specific behavioural signatures, with temporal features showing high variability (coefficient of variation greater than 0.8) while structural features such as sequence patterns remain comparatively stable across architectures.
- Incorporating model identity as an additional feature yields model aware detection with 90.6% universal accuracy across all evaluated models.
- The cross‑model evaluation spans six LLMs from five providers and 1 198 traces; same‑model accuracy ranges from 82% to 100%, and cross‑model transfer is strongest between GPT‑OSS and GPT‑5.1 at 54.0%.
- Ensemble approaches underperform relative to model aware detection; model aware detection provides notably better cross‑model performance and consistency.
- The study contributes an open multi‑LLM behavioural trace dataset and a detection framework to enable reproducible cross‑model security research.
Limitations
The work relies on traces generated in a controlled environment and may not capture all real‑world attack variations; adaptive adversaries could attempt to evade the proposed features. Detection assumes knowledge of the generating LLM at inference time, and major model updates may necessitate retraining. The 51 features may be complemented or replaced by alternative approaches to achieve cross‑LLM generalisation, and findings may not directly generalise to fundamentally different architectures or self‑hosted deployments. The dataset, while substantial, may not cover the full diversity of production agent behaviours, and timing features may reflect infrastructure as well as model effects.
Why It Matters
Practically, the work shows that detectors trained on a single LLM are ineffective for multi‑LLM deployments, underscoring the need for cross‑model provenance, multi‑LLM testing, and shared datasets and tools to monitor backdoors in enterprise toolchains. A model aware detection approach offers a practical path forward, achieving about 90% universal accuracy and enabling protection across heterogeneous AI ecosystems. The results have security implications for preventing covert manipulation or data leakage in automated pipelines across industry, highlighting the importance of cross‑model governance and supply‑chain security in AI powered workflows.