Stateful monitoring catches distributed LLM agent attacks
Agents
Agent safety tooling has a blind spot you can drive a truck through. Most monitors score one session at a time. Attackers do not have to play by those rules. This paper shows how to spread a single harmful goal across many accounts so each transcript looks harmless, then ships a defence that actually moves the needle.
How the attack hides in plain sight
The authors build a distributed agent attack using a multi-agent scaffold called ARTEMIS. One orchestrator decomposes a cybersecurity objective into small, low-salience subtasks and delegates them to subagents with limited context. Each subagent sees a sliver of the plan and produces outputs that look like routine coding or admin tasks. No single transcript screams misuse, so a standard per-context monitor lets it through. In tests, those standard monitors caught this style of campaign only about one fifth as often as earlier single-context decomposition attacks, and often only after two or more exploit actions had already fired.
What the defence changes
The proposed monitor goes stateful and online. Every incoming request gets two treatments: a quick local safety score, plus an embedding into a semantic vector space. The system uses stream clustering to group related transcripts across accounts without reprocessing the whole history. Clusters accumulate “weak signals” from many sessions. When a cluster crosses calibrated thresholds (top-k sums and minimum counts), the system escalates to a cross-context judge, a smaller Large Language Model (LLM) that reasons over deduplicated evidence from the cluster.
In large simulated datacentre traffic, this stateful approach Pareto dominated a standard monitor. Under a 10x benign background, it reached around 81.1–87.2% campaign recall at roughly 2.35–3.09% benign-session false positive rate versus 75.0% for the baseline. At 100x benign background, it managed 65.8% versus 55% at comparable false positives. Detection also came earlier: at 100x background, the 99th-percentile time-to-detect dropped from 37.2 to 25.4 attacker turns. Latency stayed near-baseline for about 99% of requests because the expensive cross-context reasoning triggers rarely, though there is a heavier p99 tail.
The team ran a two-month red-team with more than 100,000 attacks. They found practical evasion tricks like renaming files or paraphrasing to dodge embedding-based routing, and mitigated some of it by routing to multiple related clusters, deduplication, and top-k evidence selection. A nice side-effect: the stateful view caught repeated jailbreak variants reused across accounts.
Limits remain. Strong obfuscation and robust jailbreaks still hurt recall, gains narrow as benign traffic scales, and cross-account aggregation raises privacy and scalability questions. But the message is clear: if you monitor agents in isolation, you will miss coordinated abuse. Group-level reasoning is the new baseline.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Stateful Online Monitoring Catches Distributed Agent Attacks
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies a blind spot in current safety monitoring for language-model agents: attackers can split a single harmful objective into many low-salience subtasks across different user accounts so that each transcript appears benign. Standard monitors score only one agent context at a time and are therefore structurally unable to detect misuse that is only visible in aggregate across accounts. The threat is concrete because agents can coordinate to perform complex cybersecurity tasks, and distributed misuse has been observed in real incidents where evidence was stitched together from multiple sessions.
Approach
The authors implement a general distributed agent attack using a multi-agent scaffold (ARTEMIS) in which an orchestrator decomposes a cybersecurity objective into innocuous-looking subtasks executed by subagents whose individual contexts are limited. To defend, they design an online stateful monitor that operates in real time: each incoming request is scored by a standard local monitor, embedded into a semantic vector space, and routed to compact online clusters of prior transcripts. When a cluster accumulates sufficient cross-session suspiciousness the system escalates rarely to a cross-context language-model judge that reasons over assembled evidence. The monitor uses stream clustering to avoid re-clustering all history on every request, stores deduplicated buffered contexts per cluster, and triggers escalation based on calibrated top-k sums and minimum counts. Evaluation uses simulated datacentre traffic intermixing distributed attacks with large backgrounds of benign coding and chat agent conversations and benchmark cybersecurity tasks drawn from Cybench; different model roles included gpt-5.2 as subagents, gpt-oss-120b as weak supervisors, and gpt-4.1-mini as the monitor in experiments. The authors also ran an extensive two-month white-box red-teaming exercise with >100,000 attacks.
Key Findings
- The distributed agent attack completes difficult cybersecurity tasks while concealing the harmful objective across subagents; standard per-context monitors detect these campaigns far less often and later than prior single-context decomposition attacks.
- Standard monitors caught distributed attacks only about one fifth as often as earlier agent attacks and distributed attacks typically perform two or more exploit actions before detection.
- The online stateful monitor Pareto dominates standard monitors in simulated traffic: it raises campaign recall (examples include 81.1–87.2% recall at 2.35–3.09% benign-session false positive rate versus 75.0% for the standard monitor under a 10x benign background), and at 100x background achieves 65.8% versus 55% recall at comparable false-positive rates.
- Stateful monitoring detects campaigns earlier. At 100x benign background the monitor reduced the 99th-percentile time-to-detect from 37.2 attacker turns to 25.4 while using fewer false positives.
- Latency impact is negligible for the majority of users: roughly 99% of requests incur near-standard-monitor latency while only a small fraction trigger expensive cross-context reasoning; however there is a heavier tail at p99.
- Red-teaming exposed failure modes (embedding evasion via renaming or paraphrase and persistent ambiguous cluster memory) that were mitigated by routing to multiple related clusters, deduplication, and top-k evidence selection. The stateful approach also improved detection of repeated jailbreak variants across accounts.
Limitations
The defence does not fully solve distributed misuse: adaptive attackers can still obfuscate tasks or use strong jailbreaks to hide evidence. Performance gains shrink as benign background traffic scales very large and there remains a latency tail for some requests. The evaluation uses simulated traffic and chosen benchmark tasks, and deployment raises practical concerns about scalability and privacy when aggregating cross-account activity.
Implications
Offensively, attackers can operationalise multi-agent orchestrators to decompose and distribute complex exploits across many accounts so that single-session monitors fail to detect them. Weak supervisory agents can delegate to stronger subagents to recover capabilities while keeping each subagent’s view benign. Attackers may evade embedding-based routing by renaming files or paraphrasing tasks and may attempt monitor-context poisoning or reuse attack variants across accounts to exploit cross-context aggregation. These behaviours underscore that adversaries who control many accounts and can coordinate subtasks pose a material escalation in abuse risk for deployed agent platforms.