ShortSpan.ai logo

Stateful monitoring catches distributed LLM agent attacks

Agents
Published: Mon, Jun 01, 2026 • By Rowan Vale
Stateful monitoring catches distributed LLM agent attacks
New research shows attackers can split a harmful objective across many accounts so each agent session looks clean. The authors build such a distributed attack and propose an online, stateful monitor that clusters transcripts and escalates rarely to an LLM judge. It detects campaigns about 30% earlier with low false positives and minimal latency for most users.

Agent safety tooling has a blind spot you can drive a truck through. Most monitors score one session at a time. Attackers do not have to play by those rules. This paper shows how to spread a single harmful goal across many accounts so each transcript looks harmless, then ships a defence that actually moves the needle.

How the attack hides in plain sight

The authors build a distributed agent attack using a multi-agent scaffold called ARTEMIS. One orchestrator decomposes a cybersecurity objective into small, low-salience subtasks and delegates them to subagents with limited context. Each subagent sees a sliver of the plan and produces outputs that look like routine coding or admin tasks. No single transcript screams misuse, so a standard per-context monitor lets it through. In tests, those standard monitors caught this style of campaign only about one fifth as often as earlier single-context decomposition attacks, and often only after two or more exploit actions had already fired.

What the defence changes

The proposed monitor goes stateful and online. Every incoming request gets two treatments: a quick local safety score, plus an embedding into a semantic vector space. The system uses stream clustering to group related transcripts across accounts without reprocessing the whole history. Clusters accumulate “weak signals” from many sessions. When a cluster crosses calibrated thresholds (top-k sums and minimum counts), the system escalates to a cross-context judge, a smaller Large Language Model (LLM) that reasons over deduplicated evidence from the cluster.

In large simulated datacentre traffic, this stateful approach Pareto dominated a standard monitor. Under a 10x benign background, it reached around 81.1–87.2% campaign recall at roughly 2.35–3.09% benign-session false positive rate versus 75.0% for the baseline. At 100x benign background, it managed 65.8% versus 55% at comparable false positives. Detection also came earlier: at 100x background, the 99th-percentile time-to-detect dropped from 37.2 to 25.4 attacker turns. Latency stayed near-baseline for about 99% of requests because the expensive cross-context reasoning triggers rarely, though there is a heavier p99 tail.

The team ran a two-month red-team with more than 100,000 attacks. They found practical evasion tricks like renaming files or paraphrasing to dodge embedding-based routing, and mitigated some of it by routing to multiple related clusters, deduplication, and top-k evidence selection. A nice side-effect: the stateful view caught repeated jailbreak variants reused across accounts.

Limits remain. Strong obfuscation and robust jailbreaks still hurt recall, gains narrow as benign traffic scales, and cross-account aggregation raises privacy and scalability questions. But the message is clear: if you monitor agents in isolation, you will miss coordinated abuse. Group-level reasoning is the new baseline.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Stateful Online Monitoring Catches Distributed Agent Attacks

Authors: Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong, Ivan Zhang, Matan Shtepel, Steffi Chern, Alexander Robey, Eric Wong, and Hamed Hassani
Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies a blind spot in current safety monitoring for language-model agents: attackers can split a single harmful objective into many low-salience subtasks across different user accounts so that each transcript appears benign. Standard monitors score only one agent context at a time and are therefore structurally unable to detect misuse that is only visible in aggregate across accounts. The threat is concrete because agents can coordinate to perform complex cybersecurity tasks, and distributed misuse has been observed in real incidents where evidence was stitched together from multiple sessions.

Approach

The authors implement a general distributed agent attack using a multi-agent scaffold (ARTEMIS) in which an orchestrator decomposes a cybersecurity objective into innocuous-looking subtasks executed by subagents whose individual contexts are limited. To defend, they design an online stateful monitor that operates in real time: each incoming request is scored by a standard local monitor, embedded into a semantic vector space, and routed to compact online clusters of prior transcripts. When a cluster accumulates sufficient cross-session suspiciousness the system escalates rarely to a cross-context language-model judge that reasons over assembled evidence. The monitor uses stream clustering to avoid re-clustering all history on every request, stores deduplicated buffered contexts per cluster, and triggers escalation based on calibrated top-k sums and minimum counts. Evaluation uses simulated datacentre traffic intermixing distributed attacks with large backgrounds of benign coding and chat agent conversations and benchmark cybersecurity tasks drawn from Cybench; different model roles included gpt-5.2 as subagents, gpt-oss-120b as weak supervisors, and gpt-4.1-mini as the monitor in experiments. The authors also ran an extensive two-month white-box red-teaming exercise with >100,000 attacks.

Key Findings

  • The distributed agent attack completes difficult cybersecurity tasks while concealing the harmful objective across subagents; standard per-context monitors detect these campaigns far less often and later than prior single-context decomposition attacks.
  • Standard monitors caught distributed attacks only about one fifth as often as earlier agent attacks and distributed attacks typically perform two or more exploit actions before detection.
  • The online stateful monitor Pareto dominates standard monitors in simulated traffic: it raises campaign recall (examples include 81.1–87.2% recall at 2.35–3.09% benign-session false positive rate versus 75.0% for the standard monitor under a 10x benign background), and at 100x background achieves 65.8% versus 55% recall at comparable false-positive rates.
  • Stateful monitoring detects campaigns earlier. At 100x benign background the monitor reduced the 99th-percentile time-to-detect from 37.2 attacker turns to 25.4 while using fewer false positives.
  • Latency impact is negligible for the majority of users: roughly 99% of requests incur near-standard-monitor latency while only a small fraction trigger expensive cross-context reasoning; however there is a heavier tail at p99.
  • Red-teaming exposed failure modes (embedding evasion via renaming or paraphrase and persistent ambiguous cluster memory) that were mitigated by routing to multiple related clusters, deduplication, and top-k evidence selection. The stateful approach also improved detection of repeated jailbreak variants across accounts.

Limitations

The defence does not fully solve distributed misuse: adaptive attackers can still obfuscate tasks or use strong jailbreaks to hide evidence. Performance gains shrink as benign background traffic scales very large and there remains a latency tail for some requests. The evaluation uses simulated traffic and chosen benchmark tasks, and deployment raises practical concerns about scalability and privacy when aggregating cross-account activity.

Implications

Offensively, attackers can operationalise multi-agent orchestrators to decompose and distribute complex exploits across many accounts so that single-session monitors fail to detect them. Weak supervisory agents can delegate to stronger subagents to recover capabilities while keeping each subagent’s view benign. Attackers may evade embedding-based routing by renaming files or paraphrasing tasks and may attempt monitor-context poisoning or reuse attack variants across accounts to exploit cross-context aggregation. These behaviours underscore that adversaries who control many accounts and can coordinate subtasks pose a material escalation in abuse risk for deployed agent platforms.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.