New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

AgentWatcher spots prompt injection in LLM agents

Published: Thu, Apr 02, 2026 • By Lydia Stratus

Agents

AgentWatcher monitors Large Language Model agents for prompt injection by attributing actions to small, influential context windows, then applying explicit rules with a monitor LLM. Results cut attack success, including on long contexts, with modest utility loss, but add 8–10 seconds per check. Use selectively and curate rules.

Prompt injection against Large Language Model (LLM) agents is not a thought experiment. It is what happens when an agent reads a poisoned wiki page or an email thread and then decides to email credentials to “verify access”. The research community has thrown a pile of detectors at this, but many fall over once context gets long, and too many act like black boxes. That is not helpful at 3am when you need to explain why a run was blocked.

AgentWatcher takes a more operationally friendly approach. It first attributes an agent’s proposed action to a handful of the most causally influential text spans in the history and retrieved context. That trims a messy, thousand-line context down to a compact slice that probably drove the action. It then passes that slice, the task summary and the action into a separate monitor LLM that reasons over an explicit rule set defining what does and does not count as prompt injection. The output is a binary decision and a short reasoning trace you can audit. The authors also show you can fine-tune the monitor with GRPO to improve policy adherence, and even auto-generate rules if you lack time for handcrafting.

On agent and long-context benchmarks, the results are solid. On AgentDojo, attack success dropped to at most 1% with about a 2% utility loss. Across long-context tasks, it was the only method that consistently kept attack success at or below 10%. The approach generalises across different backbone and attribution LLMs, with utility loss typically below 4%. The price is latency: around 8 to 10 seconds per detection. The paper advises selective invocation, which is the right call in production.

Where this fits in your stack

If you run agents that call tools, put AgentWatcher between the model and anything that mutates state. Gating file operations, ticketing APIs, data deletion, outbound network calls and anything that touches secrets is the priority. Trigger the monitor only when an action crosses a boundary you care about; do not run it on every chat turn. For retrieval-augmented generation, run it when the retrieved corpus includes untrusted web content, user-supplied documents or long organisational wikis, since those are common injection venues.

On the serving side, treat the monitor as a sidecar service your orchestrator calls with the attributed windows, the action and the task. Log the decision, the rules referenced and the extracted malicious instruction into your SIEM for incident review. The attribution step helps here: you store the minimal causal text rather than the entire transcript, which cuts noise and makes post-mortems faster.

For GPU clusters, the monitor itself is another LLM call, so plan capacity and timeouts. Keep it on a separate autoscaled pool so a flood of monitors does not starve your main inference path. If you enforce “must-pass” for high-risk actions, add a circuit breaker so timeouts fail closed. For secrets management, make the monitor’s allow decision a prerequisite for any token unseal or vault read, and record the rule hits alongside the access log entry.

Operational trade-offs

This works because it narrows the text the detector must reason over and makes the policy explicit. It fails if your rules are vague or your allow-list is too strict. Adaptive attackers can still pad or camouflage malicious text, so expect some false positives on benign but bossy instructions. The paper shows robustness improves as you increase the number of attributed windows, at the cost of more text through the monitor and therefore more latency. That is a knob you can tune per action class.

Practical path forward: start with a minimal rule set focused on tool-use boundaries, enable monitoring on a small set of high-risk actions, and iterate based on alerts. Use the built-in reasoning trace for analyst review and to refine rules. The authors report code is available, and the approach is transparent enough to slot into existing agent frameworks without rewiring your estate. Just do not pretend it is free: budget the extra 8 to 10 seconds where it matters, and keep pressure-testing your rules against adaptive prompts. Better an annoyed user than a 3am credentials drill.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

AgentWatcher: A Rule-based Prompt Injection Monitor

Authors: Yanting Wang, Wei Zou, Runpeng Geng, and Jinyuan Jia

Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be implicit, opaque, and difficult to reason about. In this work, we propose AgentWatcher to address the above two limitations. To address the first limitation, AgentWatcher attributes the LLM's output (e.g., the action of an agent) to a small set of causally influential context segments. By focusing detection on a relatively short text, AgentWatcher can be scalable to long contexts. To address the second limitation, we define a set of rules specifying what does and does not constitute a prompt injection, and use a monitor LLM to reason over these rules based on the attributed text, making the detection decisions more explainable. We conduct a comprehensive evaluation on tool-use agent benchmarks and long-context understanding datasets. The experimental results demonstrate that AgentWatcher can effectively detect prompt injection and maintain utility without attacks. The code is available at https://github.com/wang-yanting/AgentWatcher.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies prompt injection attacks against large language model driven agents and the limitations of current detection methods. Existing detectors lose effectiveness as context length grows and often make implicit, hard to audit decisions because they lack explicit rules that define prompt injection. These limits reduce scalability to long-context agent applications and make reasoning about detections difficult for security teams.

Approach

AgentWatcher is a two-phase, rule-grounded detection monitor. Phase one attributes a proposed agent action to a compact subset of the concatenated external contexts by analysing attention patterns from an attribution LLM. It locates high-attention sink tokens using a small sliding window and expands surrounding spans to form K non-overlapping attributed windows, whose concatenation is the focused context. Phase two feeds that attributed context, the user task summary and the agent action to a monitor LLM that applies an explicit, customisable set of rules and outputs both a binary detection decision and a reasoning trace. The monitor can be fine-tuned with a policy optimisation method (GRPO) using a BLEU-based reward that encourages correct detection and accurate extraction of the injected instruction. The design supports automatic rule generation and allows the monitor to reference rules during its reasoning. Evaluations use multiple agent and long-context benchmarks and compare against state-of-the-art detection methods.

Key Findings

Attribution to compact context segments improves scalability to long histories and avoids fragmenting injected instructions across fixed partitions.
AgentWatcher outperforms prior detectors on agent and long-context benchmarks, achieving notably low attack success rates while preserving task utility. On AgentDojo it reduced attack success to at most 1% with about a 2% utility loss; across long-context tasks it was the only method that consistently kept attack success at or below 10%.
The method generalises across backbone and attribution LLMs, often reducing attack success to near zero with utility loss typically below 4%.
GRPO fine-tuning of the monitor further lowers attack success while preserving utility; automatic rule generation yields comparable performance to human-crafted rules with trade-offs between utility and security depending on strategy.
The system is broadly insensitive to core hyperparameters such as window size and expansion lengths, while increasing the number of attributed windows improves robustness.
AgentWatcher incurs higher runtime cost than non-LLM detectors, around 8 to 10 seconds per detection, so selective invocation is recommended.

Limitations

Main constraints are added computational latency due to LLM-based reasoning, sensitivity to the quality of rules, and remaining vulnerability to adaptive attacks. The monitor can be influenced if attackers surround a malicious instruction with adaptive text, and benign instructions that resemble malicious ones can produce false positives. Practical deployment therefore requires selective triggering, rule curation, and further study of strong adaptive adversaries.

Why It Matters

AgentWatcher offers a practical, auditable detection layer for LLM agents by combining causal attribution with explicit rule-based reasoning. Its focus on short, causally influential text makes detection scalable to long-context agents and its rule-guided outputs improve interpretability and auditability for security teams. The approach complements prevention methods and can be integrated selectively to protect high-risk actions, though operators must weigh increased latency and the need to harden rule sets against adaptive attacks.

Links Original paper on arXiv

AgentWatcher spots prompt injection in LLM agents

Where this fits in your stack

Operational trade-offs

📋 Original Paper Title and Abstract

AgentWatcher: A Rule-based Prompt Injection Monitor

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

Study Exposes Prompt Injection Risks for LLM Agents

Stateful monitoring catches distributed LLM agent attacks

IterInject weaponises feedback to hijack LLM agents

Related Research

Get the weekly digest