AgentWatcher spots prompt injection in LLM agents
Agents
Prompt injection against Large Language Model (LLM) agents is not a thought experiment. It is what happens when an agent reads a poisoned wiki page or an email thread and then decides to email credentials to “verify access”. The research community has thrown a pile of detectors at this, but many fall over once context gets long, and too many act like black boxes. That is not helpful at 3am when you need to explain why a run was blocked.
AgentWatcher takes a more operationally friendly approach. It first attributes an agent’s proposed action to a handful of the most causally influential text spans in the history and retrieved context. That trims a messy, thousand-line context down to a compact slice that probably drove the action. It then passes that slice, the task summary and the action into a separate monitor LLM that reasons over an explicit rule set defining what does and does not count as prompt injection. The output is a binary decision and a short reasoning trace you can audit. The authors also show you can fine-tune the monitor with GRPO to improve policy adherence, and even auto-generate rules if you lack time for handcrafting.
On agent and long-context benchmarks, the results are solid. On AgentDojo, attack success dropped to at most 1% with about a 2% utility loss. Across long-context tasks, it was the only method that consistently kept attack success at or below 10%. The approach generalises across different backbone and attribution LLMs, with utility loss typically below 4%. The price is latency: around 8 to 10 seconds per detection. The paper advises selective invocation, which is the right call in production.
Where this fits in your stack
If you run agents that call tools, put AgentWatcher between the model and anything that mutates state. Gating file operations, ticketing APIs, data deletion, outbound network calls and anything that touches secrets is the priority. Trigger the monitor only when an action crosses a boundary you care about; do not run it on every chat turn. For retrieval-augmented generation, run it when the retrieved corpus includes untrusted web content, user-supplied documents or long organisational wikis, since those are common injection venues.
On the serving side, treat the monitor as a sidecar service your orchestrator calls with the attributed windows, the action and the task. Log the decision, the rules referenced and the extracted malicious instruction into your SIEM for incident review. The attribution step helps here: you store the minimal causal text rather than the entire transcript, which cuts noise and makes post-mortems faster.
For GPU clusters, the monitor itself is another LLM call, so plan capacity and timeouts. Keep it on a separate autoscaled pool so a flood of monitors does not starve your main inference path. If you enforce “must-pass” for high-risk actions, add a circuit breaker so timeouts fail closed. For secrets management, make the monitor’s allow decision a prerequisite for any token unseal or vault read, and record the rule hits alongside the access log entry.
Operational trade-offs
This works because it narrows the text the detector must reason over and makes the policy explicit. It fails if your rules are vague or your allow-list is too strict. Adaptive attackers can still pad or camouflage malicious text, so expect some false positives on benign but bossy instructions. The paper shows robustness improves as you increase the number of attributed windows, at the cost of more text through the monitor and therefore more latency. That is a knob you can tune per action class.
Practical path forward: start with a minimal rule set focused on tool-use boundaries, enable monitoring on a small set of high-risk actions, and iterate based on alerts. Use the built-in reasoning trace for analyst review and to refine rules. The authors report code is available, and the approach is transparent enough to slot into existing agent frameworks without rewiring your estate. Just do not pretend it is free: budget the extra 8 to 10 seconds where it matters, and keep pressure-testing your rules against adaptive prompts. Better an annoyed user than a 3am credentials drill.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
AgentWatcher: A Rule-based Prompt Injection Monitor
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies prompt injection attacks against large language model driven agents and the limitations of current detection methods. Existing detectors lose effectiveness as context length grows and often make implicit, hard to audit decisions because they lack explicit rules that define prompt injection. These limits reduce scalability to long-context agent applications and make reasoning about detections difficult for security teams.
Approach
AgentWatcher is a two-phase, rule-grounded detection monitor. Phase one attributes a proposed agent action to a compact subset of the concatenated external contexts by analysing attention patterns from an attribution LLM. It locates high-attention sink tokens using a small sliding window and expands surrounding spans to form K non-overlapping attributed windows, whose concatenation is the focused context. Phase two feeds that attributed context, the user task summary and the agent action to a monitor LLM that applies an explicit, customisable set of rules and outputs both a binary detection decision and a reasoning trace. The monitor can be fine-tuned with a policy optimisation method (GRPO) using a BLEU-based reward that encourages correct detection and accurate extraction of the injected instruction. The design supports automatic rule generation and allows the monitor to reference rules during its reasoning. Evaluations use multiple agent and long-context benchmarks and compare against state-of-the-art detection methods.
Key Findings
- Attribution to compact context segments improves scalability to long histories and avoids fragmenting injected instructions across fixed partitions.
- AgentWatcher outperforms prior detectors on agent and long-context benchmarks, achieving notably low attack success rates while preserving task utility. On AgentDojo it reduced attack success to at most 1% with about a 2% utility loss; across long-context tasks it was the only method that consistently kept attack success at or below 10%.
- The method generalises across backbone and attribution LLMs, often reducing attack success to near zero with utility loss typically below 4%.
- GRPO fine-tuning of the monitor further lowers attack success while preserving utility; automatic rule generation yields comparable performance to human-crafted rules with trade-offs between utility and security depending on strategy.
- The system is broadly insensitive to core hyperparameters such as window size and expansion lengths, while increasing the number of attributed windows improves robustness.
- AgentWatcher incurs higher runtime cost than non-LLM detectors, around 8 to 10 seconds per detection, so selective invocation is recommended.
Limitations
Main constraints are added computational latency due to LLM-based reasoning, sensitivity to the quality of rules, and remaining vulnerability to adaptive attacks. The monitor can be influenced if attackers surround a malicious instruction with adaptive text, and benign instructions that resemble malicious ones can produce false positives. Practical deployment therefore requires selective triggering, rule curation, and further study of strong adaptive adversaries.
Why It Matters
AgentWatcher offers a practical, auditable detection layer for LLM agents by combining causal attribution with explicit rule-based reasoning. Its focus on short, causally influential text makes detection scalable to long-context agents and its rule-guided outputs improve interpretability and auditability for security teams. The approach complements prevention methods and can be integrated selectively to protect high-risk actions, though operators must weigh increased latency and the need to harden rule sets against adaptive attacks.