New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email
// Analysis

Non-malleable authority locks down LLM agent memory

Agents
Non-malleable authority locks down LLM agent memory

New research targets a blind spot in Large Language Model (LLM) agents: long-term memory that attackers can poison today and cash out later. The authors show content and lineage checks are easy to launder, then ship a non-malleable, origin-bound control that blocks laundering attacks across eight models at zero recorded utility loss.

Long-term memory in Large Language Model (LLM) agents is a gift to anyone who likes delayed gratification. Plant a seed in one session, let it sit, and wait for the agent to act on it later when stakes are higher. Payments, settings, data pulls. The usual fun. A new paper cuts into this class of memory poisoning and, unusually for this area, backs its claims with a formal model and machine-checked proofs.

Three laundering routes

Most current defences try to rate the content itself or follow the derivation chain. The authors show both are malleable in agents. First, self-summarisation: get the agent to paraphrase your dodgy note, and the new summary looks clean while the lineage flips to trusted because the agent wrote it. Second, trusted-tool echo: cause a reliable tool to repeat attacker content, and the echo inherits the tool’s halo. Third, manufactured corroboration: create apparent agreement across sources and watch content scorers and naive lineage systems elevate it as consensus. In their benchmark, laundering pushes attack success up to 68 percent for representative baselines, with direct attacks hitting up to 84 percent in some cases.

The buried lever: bind at write time

The formal model tracks the write to retrieve to act pipeline. The separation result is blunt: no content or lineage policy is sound under laundering; binding an item’s origin at write time is necessary; and non-malleable, origin-bound authority with Sybil-resistant corroboration-gated elevation is sufficient. The construction, TMA-NM, tags memory with origin labels as it is written, propagates those labels non-malleably, and only allows elevation when corroborated by independent trusted principals or by a fresh, action-bound user authorisation. Verdicts land in a tamper-evident log.

On the numbers, TMA-NM hits zero percent attack success on both direct and laundering attacks across all models and channels in the benchmark, while keeping legitimate authorised actions at 100 percent. The act gate is a deterministic check taking about 1.3 microseconds rather than an extra model call that burns roughly two seconds, so latency is a non-issue.

What would I try on an engagement after reading this? Attack the assumptions the authors admit. If write-time origin labelling is wrong or a trusted channel is compromised, the guarantees fall with it. If independent corroborators are scarce, the system falls back to user confirmation, so the social engineering surface moves to the authorisation channel. Answer bias for non-consequential outputs is out of scope, leaving room for soft steering that never touches a gated tool. Value-level capability tokens and unbounded proofs are future work, so watch for authority leakage inside structured payloads and for edge cases at scale.

Credit where it is due: this is one of the few agent memory defences that names the laundering tricks and then actually closes them in both proof and experiments. The benchmark, harness and TLA+ models are public, which means the rest of us can check, and try to break, the story.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees

Authors: Yedidel Louck
LLM agents increasingly rely on persistent long-term memory, which creates a critical vulnerability that we study here: memory poisoning. An adversary can store untrusted content in one session that later steers a consequential action, such as a payment, a setting change, or data exfiltration, in a future session. Existing defenses base a memory item's authority to act on either its content (detection or trust-scoring) or its derivation history (lineage). We show that both signals are malleable. An attacker can launder an untrusted origin through three channels specific to LLM agents: the agent's own summarization, a trusted-tool echo, and manufactured corroboration. Each makes the content look benign and breaks or flips its derivation edge to ``trusted.'' We formalize malleability for the memory write-retrieve-act pipeline and prove a machine-checked separation theorem. No content- or lineage-based defense is sound under laundering (T1), write-time origin binding is necessary (T2), and non-malleable origin-bound authority with Sybil-resistant corroboration-gated elevation is sufficient (T3). Our construction, TMA-NM (Tamper-evident Memory Authority, Non-Malleable), instantiates non-malleable information-flow control (IFC) for LLM-agent memory. A cross-defense, cross-attack, and cross-model benchmark over eight frontier models shows that existing defenses fail exactly where the theory predicts (up to 68% laundering attack-success), while TMA-NM reaches 0% attack success on both direct and laundering attacks across all models and channels, at full legitimate utility. We release the benchmark, harness, and machine-checked TLA+ models to support reproducibility.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies memory poisoning in long‑term memory used by LLM agents. An adversary can write untrusted content in one session that later causes consequential actions in another session, such as payments, setting changes or data exfiltration. Existing defences rely on either content inspection or derivation history, but both signals are malleable: an attacker can launder an untrusted origin so content looks benign or so lineage edges appear trusted. The paper formalises these laundering channels and shows why current approaches can fail.

Approach

The authors formalise the write→retrieve→act pipeline, define three LLM‑specific laundering channels (self‑summarisation, trusted‑tool echo, manufactured corroboration) and model authority as labels bound to an item’s origin at write time. They prove a machine‑checked separation theorem showing malleability of content‑ and lineage‑based defences, and that non‑malleable, origin‑bound authority with Sybil‑resistant corroboration is sufficient. They construct TMA‑NM (Tamper‑evident Memory Authority, Non‑Malleable), a monitor that records origin labels at write time, propagates authority non‑malleably, requires corroboration from independent trusted principals or a fresh, action‑bound user authorisation to elevate untrusted items, and appends verdicts to a tamper‑evident log. They evaluate across a cross‑defence, cross‑attack benchmark spanning 12 domains and five consequential tool types, running experiments on eight frontier models from six vendors. The formal model and proofs are machine‑checked in TLA+ and the benchmark and harness are released.

Key Findings

  • Content and lineage defences are malleable: a constructed laundering attack defeats representative systems, producing laundering attack success up to 68% and direct‑attack success up to 84% for some baselines.
  • Write‑time origin binding is necessary: the model and exhaustive checks show that defences that do not bind authority at write time admit reachable violations of the security invariant.
  • TMA‑NM is sufficient in practice: across eight models and all laundering channels, TMA‑NM reached 0% attack success on both direct and laundering attacks while preserving full legitimate utility (100% authorised legitimate actions in the unified benchmark).
  • Performance and deployability: the act gate is a simple deterministic check costing about 1.3 microseconds per decision versus roughly 2000 milliseconds for a content‑detection judge, so it adds negligible latency and no extra model calls.
  • Mechanisms are complementary: origin binding prevents self‑summarisation and tool‑echo laundering, corroboration gating prevents manufactured corroboration, and the tamper‑evident log provides auditability without affecting latency.

Limitations

The guarantee depends on correct write‑time origin labelling (assumption A1); a compromised trusted channel undermines labels. Answer bias for non‑consequential responses is not prevented by design. Value‑level attribution (capability tokens for nested structured payloads) and a fully mechanised unbounded proof are left to future work. Availability of independent corroborators is deployment dependent and raises a user confirmation cost when insufficient corroboration exists.

Implications

Offensive implications are that an attacker can launder untrusted inputs into persistent agent memory by paraphrase, by causing trusted tools to echo attacker content, or by fabricating corroboration to trigger consequential actions later. With malleable defences, such laundering can enable payments, data exfiltration and control‑flow hijacks across sessions. Defeating TMA‑NM would require forging origin labels, compromising two or more genuinely independent trusted principals, or breaking the user authorisation channel, which raises the attacker effort and scope required for successful attacks.

// Similar research

Related Research

Get the weekly digest

The few AI-security papers that matter, with the practitioner takeaway. No spam.