ShortSpan.ai logo

Trojan backdoors seize control of local LLM agents

Agents
Published: Mon, Jun 01, 2026 • By Adrian Calder
Trojan backdoors seize control of local LLM agents
New research shows local LLM agents are wide open to multi-step prompt-injection backdoors. In a simulated workspace with GPT-5.4, the ClawTrojan benchmark achieves 95.5% success by planting control-like text in files that agents later execute. A proposed defence, DASGuard, traces provenance and sanitises writes, but testing is sandboxed and model-specific.

Agent security keeps shifting left. The interesting failures no longer happen in a chat window, they happen in the local workspace where Large Language Models (LLMs) read and write files, call tools, and keep state. That convenience creates a quieter problem: plant one line of control-like text today, take the wheel tomorrow. A new benchmark, ClawTrojan, quantifies this. In an OpenClaw-style local setup with GPT-5.4, multi-step trojans reach a 95.5% attack success rate; the same model shrugs off single-turn prompt-injection tests.

How the attack lands

An attacker supplies a file or tool output that looks like ordinary data but carries instructions: a comment block, a TODO, a policy snippet. The agent later reads it, stores it in memory or another file, and eventually treats it as instructions. No individual step screams malicious. The effect is persistent influence that survives sessions and resurfaces as plans, targets or external actions. The benchmark covers memory poisoning, trust laundering, skill poisoning, and fragmented or delayed chains.

Why do common defences miss it? Most filters stare at the current prompt, the immediate action, or a single capability call. They can block an obvious exfil step, but they do not remember who wrote the stray line in a local file, or when. Without provenance across steps, the system cannot connect the write that plants the payload to the read that activates it.

What the defence tries

The proposed DASGuard sits at the harness boundary. It Detects control-like spans in sensitive files using rules, embeddings and prior-finding signals; Attributes them by building a content-source graph; and Sanitises writes by stripping or shadow-committing content that lacks a trusted origin. It also blocks external actions triggered by untrusted control. In the experiments, this combination meaningfully cuts long-horizon compromise. Ablations point to provenance labels doing most of the work, with embedding and memory matching helping. On clean or borderline tasks it errs on the side of caution, generating a moderate false-block rate that can be kicked to user review.

So what? If an attacker can feed your agent’s workspace, they can steer it later. That moves provenance, workspace hygiene and sanitised commits from nice-to-have to table stakes. Caveats apply: the results come from synthetic sandboxes, specific models and limited clean-task coverage, and the defence assumes you can instrument the harness to label sources and intercept writes. The open questions are the practical ones: how far provenance survives real toolchains, what false-positive budgets teams can tolerate, and how to recover once the state is already dirty. Watch this space.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Authors: Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, and Ji-Rong Wen
LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies multi-step trojan attacks against local LLM agent harnesses, where attackers place seemingly benign text into files or tool outputs that an agent later reads, stores and executes as control content. Such planted content can persist across sessions and reappear as instructions, policies or action targets, creating a long-lived backdoor that single-step prompt-injection defences often miss. This threat matters because agentic systems increasingly read and write local workspace state, call tools and reuse memory, expanding the attack surface beyond single-turn conversations.

Approach

The authors introduce ClawTrojan, a runnable benchmark of multi-step trojan chains implemented in OpenClaw-style sandbox workspaces. The release contains 362 samples (339 positive attack samples and 23 clean or borderline controls), 1672 annotated step rows and a positive-split of 919 malicious runnable step environments. Attack families include memory poisoning, trust laundering, skill poisoning, fragmented and delayed patterns. They also propose DASGuard, a dynamic Detect-Attribute-Sanitise defence placed on the harness boundary. DASGuard builds a content-source graph to label provenance, detects control-like spans using rule, embedding and prior-finding signals R(s), E(s) and M(s), attributes spans to sources and destinations, and either blocks, sanitises or commits a shadow copy of file writes. Evaluation uses GPT-5.4 as the primary base agent and compares several baseline defences that inspect prompts, actions or capabilities.

Key Findings

  • ClawTrojan shows high success for multi-step trojans: in an OpenClaw-style simulated workspace with GPT-5.4 the benchmark attains a 95.5% attack success rate across chains, while existing single-turn prompt-injection benchmarks produce near-zero ASR on the same models.
  • Common defences that inspect single steps or current context (prompt-formatting, detector gates, action gating and capability checks) only modestly reduce chain penetration because they do not carry provenance across steps or sanitise planted workspace state.
  • DASGuard substantially reduces long-horizon compromise by carrying source labels and prior findings across steps, sanitising shadow commits and blocking external actions; ablations show removing provenance labels causes the largest loss of defence effectiveness, while removing embedding or memory matching also degrades performance.
  • ClawTrojan annotations facilitate targeted measurement: the positive split includes 95 unauthorised-disclosure, 90 external side-effect, 86 task-deviation and 68 document-falsification samples, enabling per-outcome analysis.
  • On clean or borderline tasks, DASGuard lowers ASR while maintaining a moderate false-block rate; most false blocks concern cautious handling of borderline local artefacts and can be routed to user review.

Limitations

The benchmark and experiments run in synthetic OpenClaw-style sandboxes and use specific model instances, so results reflect this threat model rather than all real-world deployments. Clean-task coverage is limited (23 negative samples) and harness capabilities are assumed: DASGuard requires content-source labelling, write observation and sanitisation hooks. Adaptive attacker evaluation and recovery from already-compromised state are left for future work.

Implications

An attacker with the ability to supply files or tool outputs to a local harness can plant persistent control content that later grants sustained influence over agent behaviour. Practical offensive outcomes include falsifying documents, deviating task plans, inducing external side effects and causing unauthorised disclosure. Because attacks can be split across turns and files, defenders that inspect only the immediate prompt or action risk missing the planting step. The work highlights that provenance, workspace integrity and sanitised commits are crucial security controls for agentic systems exposed to untrusted local content.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.