LLM detectors fail across models, study finds

Agents

Published: Wed, Nov 26, 2025 • By Natalie Kestrel

LLM detectors fail across models, study finds

A new cross-LLM study shows behavioural backdoor detectors trained on a single Large Language Model (LLM) collapse when applied to others. Same‑model accuracy is 92.7% but cross‑model accuracy drops to 49.2%. Including model identity restores detection to about 90.6%, highlighting the need for multi‑model testing, provenance and model‑aware controls.

Organisations now stitch together AI agents from shared tools and pre‑trained components. That convenience brings a clear supply‑chain risk: a backdoor in one agent can move into workflows and remain undetected if defenders assume models behave the same. A recent paper systematically tests that assumption and the results are unnerving.

What the paper did

The researchers generate 1,198 execution traces from six production large language models (LLMs): GPT‑5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT‑OSS 120B, and DeepSeek Chat V3.1. They inject controlled behavioural backdoors that activate via inputs or compromised tools, extract 51 behavioural features grouped into four categories, and train two classifiers, a linear support vector machine and a random forest, on benign and backdoored traces. The study runs 36 cross‑model experiments to measure how detectors transfer across model boundaries.

The headline numbers are blunt. A detector trained and tested on the same model achieves 92.7% accuracy on average. Swap models and accuracy falls to 49.2%, a gap of 43.4 percentage points — effectively random guessing.

Why detectors break

The paper traces the failure to model‑specific behavioural signatures. Structural features, such as sequence patterns, prove comparatively stable across architectures. Temporal features do not. Timing‑related signals show very high variability, with coefficients of variation above 0.8, so behaviour that looks suspicious on one model may look normal on another simply because the models produce outputs on different schedules or with different latency characteristics. The finding is practical, not philosophical: detectors learn the quirks of a runtime as much as they learn malicious patterns.

Not all fixes are theoretical. The authors show that adding model identity as an explicit feature — making detectors model aware — lifts universal accuracy to 90.6% across the six evaluated LLMs. Ensemble voting and pooled training underperform compared with a simple model‑aware approach. The paper also contributes a publicly released multi‑LLM trace dataset and a detection framework to support reproducible research.

The work has clear limitations the authors acknowledge. Traces are from a controlled environment and may not cover all real‑world behaviours. Detection assumes you can reliably identify the generating model at inference time, and major model updates will probably force retraining. Timing features may reflect infrastructure differences as well as model internals, so teams should avoid overfitting to their testbed.

For security teams, the takeaway is straightforward and practical: single‑model detectors are a brittle defence in heterogeneous AI supply chains. Provenance, multi‑model testing and model‑aware detection materially improve your chances of catching behavioural backdoors before they exfiltrate data or disrupt automation.

Actionable checks teams can run: verify model provenance for every agent and tool; run cross‑LLM detector tests using traces from all deployed models; add model identity as a feature and monitor temporal feature drift over time.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

Authors: Arun Chowdary Sanna

As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question of cross-LLM generalization remains unexplored, a gap with serious implications for organizations deploying multiple AI systems. We present the first systematic study of cross-LLM behavioral backdoor detection, evaluating generalization across six production LLMs (GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, and DeepSeek Chat V3.1). Through 1,198 execution traces and 36 cross-model experiments, we quantify a critical finding: single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs, a 43.4 percentage point generalization gap equivalent to random guessing. Our analysis reveals that this gap stems from model-specific behavioral signatures, particularly in temporal features (coefficient of variation > 0.8), while structural features remain stable across architectures. We show that model-aware detection incorporating model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models. We release our multi-LLM trace dataset and detection framework to enable reproducible research.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies cross-LLM generalisation of behavioural backdoor detection within AI agent supply chains. As organisations rely on multiple large language models and shared tool libraries, backdoors in agents can propagate across models; understanding whether detectors trained on one LLM can detect backdoors in others is vital for multi‑LLM environments and supply-chain security.

Approach

The authors perform a systematic cross‑LLM evaluation using six production LLMs (GPT‑5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT‑OSS 120B, and DeepSeek Chat V3.1) and 1 198 execution traces across 36 cross‑model experiments. They extract 51 behavioural features organised into four categories and deploy two classifiers, linear support vector machines and random forests, with training occurring on benign and backdoored traces. A controlled backdoor injection framework creates realistic supply‑chain style attacks with triggers in inputs or compromised tools, and traces are collected from 100 diverse tasks drawn from AgentBench. The study also examines detection strategies including baseline single model detectors, pooled training, ensemble voting, and a model aware detector that includes model identity as a feature. The authors release their multi‑LLM trace dataset and detection framework to support reproducible research.

Key Findings

Detectors trained on a single model achieve 92.7% accuracy within the training distribution but 49.2% across different LLMs, a 43.4 percentage point generalisation gap equivalent to random guessing.
The cross‑LLM gap is driven by model specific behavioural signatures, with temporal features showing high variability (coefficient of variation greater than 0.8) while structural features such as sequence patterns remain comparatively stable across architectures.
Incorporating model identity as an additional feature yields model aware detection with 90.6% universal accuracy across all evaluated models.
The cross‑model evaluation spans six LLMs from five providers and 1 198 traces; same‑model accuracy ranges from 82% to 100%, and cross‑model transfer is strongest between GPT‑OSS and GPT‑5.1 at 54.0%.
Ensemble approaches underperform relative to model aware detection; model aware detection provides notably better cross‑model performance and consistency.
The study contributes an open multi‑LLM behavioural trace dataset and a detection framework to enable reproducible cross‑model security research.

Limitations

The work relies on traces generated in a controlled environment and may not capture all real‑world attack variations; adaptive adversaries could attempt to evade the proposed features. Detection assumes knowledge of the generating LLM at inference time, and major model updates may necessitate retraining. The 51 features may be complemented or replaced by alternative approaches to achieve cross‑LLM generalisation, and findings may not directly generalise to fundamentally different architectures or self‑hosted deployments. The dataset, while substantial, may not cover the full diversity of production agent behaviours, and timing features may reflect infrastructure as well as model effects.

Why It Matters

Practically, the work shows that detectors trained on a single LLM are ineffective for multi‑LLM deployments, underscoring the need for cross‑model provenance, multi‑LLM testing, and shared datasets and tools to monitor backdoors in enterprise toolchains. A model aware detection approach offers a practical path forward, achieving about 90% universal accuracy and enabling protection across heterogeneous AI ecosystems. The results have security implications for preventing covert manipulation or data leakage in automated pipelines across industry, highlighting the importance of cross‑model governance and supply‑chain security in AI powered workflows.

Attribution Original paper on arXiv