Non-malleable authority locks down LLM agent memory
New research targets a blind spot in Large Language Model (LLM) agents: long-term memory that attackers can poison today and cash out later. The authors show content and lineage checks are easy to launder, then ship a non-malleable, origin-bound control that blocks laundering attacks across eight models at zero recorded utility loss.
Long-term memory in Large Language Model (LLM) agents is a gift to anyone who likes delayed gratification. Plant a seed in one session, let it sit, and wait for the agent to act on it later when stakes are higher. Payments, settings, data pulls. The usual fun. A new paper cuts into this class of memory poisoning and, unusually for this area, backs its claims with a formal model and machine-checked proofs.
Three laundering routes
Most current defences try to rate the content itself or follow the derivation chain. The authors show both are malleable in agents. First, self-summarisation: get the agent to paraphrase your dodgy note, and the new summary looks clean while the lineage flips to trusted because the agent wrote it. Second, trusted-tool echo: cause a reliable tool to repeat attacker content, and the echo inherits the tool’s halo. Third, manufactured corroboration: create apparent agreement across sources and watch content scorers and naive lineage systems elevate it as consensus. In their benchmark, laundering pushes attack success up to 68 percent for representative baselines, with direct attacks hitting up to 84 percent in some cases.
The buried lever: bind at write time
The formal model tracks the write to retrieve to act pipeline. The separation result is blunt: no content or lineage policy is sound under laundering; binding an item’s origin at write time is necessary; and non-malleable, origin-bound authority with Sybil-resistant corroboration-gated elevation is sufficient. The construction, TMA-NM, tags memory with origin labels as it is written, propagates those labels non-malleably, and only allows elevation when corroborated by independent trusted principals or by a fresh, action-bound user authorisation. Verdicts land in a tamper-evident log.
On the numbers, TMA-NM hits zero percent attack success on both direct and laundering attacks across all models and channels in the benchmark, while keeping legitimate authorised actions at 100 percent. The act gate is a deterministic check taking about 1.3 microseconds rather than an extra model call that burns roughly two seconds, so latency is a non-issue.
What would I try on an engagement after reading this? Attack the assumptions the authors admit. If write-time origin labelling is wrong or a trusted channel is compromised, the guarantees fall with it. If independent corroborators are scarce, the system falls back to user confirmation, so the social engineering surface moves to the authorisation channel. Answer bias for non-consequential outputs is out of scope, leaving room for soft steering that never touches a gated tool. Value-level capability tokens and unbounded proofs are future work, so watch for authority leakage inside structured payloads and for edge cases at scale.
Credit where it is due: this is one of the few agent memory defences that names the laundering tricks and then actually closes them in both proof and experiments. The benchmark, harness and TLA+ models are public, which means the rest of us can check, and try to break, the story.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies memory poisoning in long‑term memory used by LLM agents. An adversary can write untrusted content in one session that later causes consequential actions in another session, such as payments, setting changes or data exfiltration. Existing defences rely on either content inspection or derivation history, but both signals are malleable: an attacker can launder an untrusted origin so content looks benign or so lineage edges appear trusted. The paper formalises these laundering channels and shows why current approaches can fail.
Approach
The authors formalise the write→retrieve→act pipeline, define three LLM‑specific laundering channels (self‑summarisation, trusted‑tool echo, manufactured corroboration) and model authority as labels bound to an item’s origin at write time. They prove a machine‑checked separation theorem showing malleability of content‑ and lineage‑based defences, and that non‑malleable, origin‑bound authority with Sybil‑resistant corroboration is sufficient. They construct TMA‑NM (Tamper‑evident Memory Authority, Non‑Malleable), a monitor that records origin labels at write time, propagates authority non‑malleably, requires corroboration from independent trusted principals or a fresh, action‑bound user authorisation to elevate untrusted items, and appends verdicts to a tamper‑evident log. They evaluate across a cross‑defence, cross‑attack benchmark spanning 12 domains and five consequential tool types, running experiments on eight frontier models from six vendors. The formal model and proofs are machine‑checked in TLA+ and the benchmark and harness are released.
Key Findings
- Content and lineage defences are malleable: a constructed laundering attack defeats representative systems, producing laundering attack success up to 68% and direct‑attack success up to 84% for some baselines.
- Write‑time origin binding is necessary: the model and exhaustive checks show that defences that do not bind authority at write time admit reachable violations of the security invariant.
- TMA‑NM is sufficient in practice: across eight models and all laundering channels, TMA‑NM reached 0% attack success on both direct and laundering attacks while preserving full legitimate utility (100% authorised legitimate actions in the unified benchmark).
- Performance and deployability: the act gate is a simple deterministic check costing about 1.3 microseconds per decision versus roughly 2000 milliseconds for a content‑detection judge, so it adds negligible latency and no extra model calls.
- Mechanisms are complementary: origin binding prevents self‑summarisation and tool‑echo laundering, corroboration gating prevents manufactured corroboration, and the tamper‑evident log provides auditability without affecting latency.
Limitations
The guarantee depends on correct write‑time origin labelling (assumption A1); a compromised trusted channel undermines labels. Answer bias for non‑consequential responses is not prevented by design. Value‑level attribution (capability tokens for nested structured payloads) and a fully mechanised unbounded proof are left to future work. Availability of independent corroborators is deployment dependent and raises a user confirmation cost when insufficient corroboration exists.
Implications
Offensive implications are that an attacker can launder untrusted inputs into persistent agent memory by paraphrase, by causing trusted tools to echo attacker content, or by fabricating corroboration to trigger consequential actions later. With malleable defences, such laundering can enable payments, data exfiltration and control‑flow hijacks across sessions. Defeating TMA‑NM would require forging origin labels, compromising two or more genuinely independent trusted principals, or breaking the user authorisation channel, which raises the attacker effort and scope required for successful attacks.