Zombie Agents Hijack LLM Memory Across Sessions
Agents
Many Large Language Model (LLM) agents now keep long‑term memory to handle multi‑session work. It helps with continuity and long‑horizon tasks. It also opens a door. This paper shows how an attacker can leave behind a small instruction that the agent dutifully writes to its memory, then later treats as policy. The authors call the resulting compromise a Zombie Agent, which is a pleasingly blunt name for a gnarly problem.
What the study tests
The threat model is strict: the attacker cannot touch model weights, tools, or memory directly. They can only publish web content the agent might read while doing a normal task. The attack has two phases. Infection happens when the agent browses a poisoned page and, through its usual update process, stores the payload as long‑term memory. Trigger happens in a later, unrelated session when that memory is retrieved and nudges the agent into unauthorised tool use or data exfiltration.
The work targets two common memory designs. For sliding‑window memory, where older entries fall off a finite context, the payload uses recursive renewal so the agent keeps rewriting the instruction forward to avoid eviction. For retrieval‑augmented generation (RAG) memory backed by a vector database, the payload spreads using semantic aliasing and embedding pollution so that many queries pull it back in despite relevance filtering.
Evaluation uses two commercial LLMs tuned for tool use and realistic web‑style tasks. The authors compare their approach with standard indirect prompt injections. They measure persistence over time and the ability to induce unauthorised actions while retaining benign task quality. This is where the work shines: it ties mechanism‑specific strategies to measurable outcomes, not anecdotes.
Findings and limits
The Zombie Agent methods beat baseline injections. On sliding‑window agents, recursive renewal kept the malicious content present throughout the experiment, with a 100% retention rate while baselines quickly disappeared once the window filled. On RAG agents, the approach stored roughly 2.5 times more payload copies than baselines and saturated top‑K retrievals, for example returning about 23 malicious entries at K=50. That uplift translated into higher attack success.
System tweaks helped but did not close the hole. Raw‑history memory updates delivered the highest attack success at around 77%. Mitigation‑like update modes pushed success down to the low double digits in some settings, but attacks still landed. Prompt‑based guardrails, including sandwich and instructional styles, shaved roughly 10–15 percentage points off success rates, yet left them well above 60% in tested configurations. Case studies in healthcare and e‑commerce make the risks concrete: covert leakage of patient identifiers and automated fraudulent purchases or credential theft driven by persisted payloads.
The study is careful about scope. It assumes a fixed toolset and a black‑box adversary. Results come from two models and specific memory and retrieval configurations, averaged over three runs. Defences explored are mostly prompt‑level. That is reasonable for isolating the memory effect, but it leaves open how robust the findings are under stricter system controls or different operational policies.
Why it matters is clear: once untrusted content becomes trusted memory, per‑session filtering is not enough. Memory itself becomes part of the trusted computing base. In practice, that points to provenance and signing for memory writes, sanitisation and tamper‑evident logging of memory updates, decay or forgetting policies to limit long‑lived implants, and cross‑session anomaly and policy checks on tool‑invoking behaviour. The next wave of agent security will need to watch not just what goes into the prompt today, but what quietly lingers and reappears tomorrow.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies a new persistent threat to self-evolving LLM agents that store and reuse long-term memory across sessions. Untrusted external content seen during a benign session can be written into that memory and later treated as instruction, enabling a persistent compromise the authors call a Zombie Agent: an agent that appears normal but retains a covert payload that can be triggered in later sessions to perform unauthorised tool actions or exfiltrate data.
Approach
The authors formalise a black-box, two-phase attack comprising Infection and Trigger phases. During Infection, an agent browsing attacker-controlled web content ingests a poisoned observation which the agent's normal memory-evolution function commits to long-term storage. During Trigger, a later, unrelated session retrieves the poisoned entry and the payload causes unauthorised actions. The work targets two common memory designs: sliding-window (finite FIFO context) and retrieval-augmented generation (RAG) with a vector database. Mechanism-specific persistence strategies are proposed: recursive self-replication to avoid FIFO eviction, and semantic aliasing plus embedding pollution to maximise retrieval across semantically unrelated queries. The threat model assumes a strict black-box attacker who can only publish external content and cannot modify model weights, memory directly, or toolsets. The attack is evaluated on two commercial LLMs optimised for tool use, using a baited Exposure Phase and a Trigger Phase on realistic web-style queries; metrics include Attack Success Rate, context retention for sliding windows, and injection/recall metrics for RAG. Several baseline indirect prompt-injection strategies are compared.
Key Findings
- The Zombie Agent framework significantly outperforms standard indirect prompt-injection baselines on both sliding-window and RAG agents, achieving sustained attack success across many trigger rounds.
- For sliding-window agents, recursive renewal preserved the malicious payload through truncation, yielding a 100% retention rate across the experiment where baseline injections rapidly vanished once the window filled.
- For RAG agents the method produced aggressive embedding pollution, storing roughly 2.5 times more payload copies than baselines (example counts reported) and saturating top-K retrievals (e.g., retrieving about 23 malicious entries at K=50), which substantially increased recall and attack success.
- Evolving memory update strategies reduce but do not eliminate risk: raw-history updates gave the highest attack success (~77%), while mitigation-like update modes reduced success to the low double digits in some settings, yet non-trivial attack execution remained possible.
- Prompt-based guardrails (Sandwich, Instructional, Spotlight) reduced success only modestly (a drop of roughly 10–15 percentage points), leaving attack success well above 60% in tested configurations.
- Qualitative case studies in healthcare and e‑commerce show realistic harms: covert exfiltration of patient identifiers and automated fraudulent purchases or credential theft driven by persisted payloads.
Limitations
The threat model assumes a black-box attacker limited to publishing external content and a fixed toolset. Experiments were performed on two specific commercial models and on particular memory and retrieval configurations with results averaged over three runs; generalisability to other models, tool permissions or adaptive defenders was not exhaustively evaluated. Tested defences were primarily prompt-based and may not represent all possible system-level controls.
Why It Matters
Memory evolution expands the attack surface: once malicious content is consolidated into trusted memory it can bypass per-session input filters and act as an insider threat. Practical security implications include the need to treat memory as part of the trusted computing base, attach provenance and signatures to updates, sanitise and log memory writes, apply decay or forgetting policies, and add cross-session anomaly and policy checks on tool-invoking behaviour. Without such memory-level controls, self-evolving agents remain vulnerable to long-lived, covert compromise.