ShortSpan.ai logo

Memory Poisoning Exposes Containment Gap in LLM Agents

Agents
Published: Fri, Jun 12, 2026 • By Marcus Halden
Memory Poisoning Exposes Containment Gap in LLM Agents
Agentic Large Language Model (LLM) frameworks ship without structural safeguards the public sector would expect. In a realistic welfare bot, a single poisoned memory entry caused persistent, targeted denials and evaded aggregate monitoring. The authors show clean, low-latency gates can stop it, arguing current agent frameworks are not secure by default.

Agentic systems are leaving the lab and fronting public services. That makes a simple question urgent: do our frameworks contain the agent, or merely connect the parts? This study looks past model output filters and inspects the scaffolding itself. The authors define six architectural containment principles for Large Language Model (LLM) agents — clean separation of planning and execution, scoped capabilities, memory integrity, validated layer transitions, authenticated channels and runtime monitoring — then ask whether popular frameworks implement them by default.

The audit result is blunt. Across LangChain, AutoGPT and the OpenAI Agents SDK, none provided secure-by-default containment for any of the six. Controls exist, but as optional add-ons or patterns a developer must wire up. That is the sort of gap attackers love: the path of least resistance is also the deployment default.

Memory poisoning in practice

To show impact, the team built a simulated government benefits agent on LangChain and generated synthetic claims. With one carefully crafted write into the agent’s persistent memory, they induced corruption that stuck across cycles and models. Targeted applicants from a chosen region saw wrongful denials spike to 88.9%. In one setting, overall accuracy fell from 0.908 to 0.558. Crucially, the same attack worked against larger, alignment-tuned commercial backends, pointing to an architectural weakness rather than a model quirk.

The stealthier variant is more unsettling. Under a five-factor eligibility policy, the poisoned memory preserved headline accuracy while multiplying targeted wrongful denials by roughly 3.5. If you only watch aggregate metrics, you miss the harm. The corruption rate held at 1.000 across seeds and backends, which is about as deterministic as attacks get.

Containment that actually contains

The neat bit is the fix. They added two lightweight, deterministic gates at the framework layer: a memory integrity validator and a tool-call policy gate. The validator dropped corruption from 1.000 to 0.000; the gate blocked 100% of tool-access bypasses. Reported overheads were tiny — medians of 0.016 ms and 0.129 ms per call, and under 0.2 ms in the abstract — which matters if you care about latency budgets.

There are caveats. The experiments ran on LangChain; reproducing the empirical results on AutoGPT and the OpenAI Agents SDK is future work. The validator uses pattern checks and could be skirted by clever phrasing; stronger, semantics-aware checks would add latency and their own failure modes. And compound trajectory attacks remain an open frontier.

Still, the core claim holds: without architectural boundaries, agent memory and tools become shared attack surfaces. The interesting question now is not whether to add gates, but how to standardise and verify them so “secure by default” is a property of the framework, not the diligence of a single team.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

Authors: Md Jafrin Hossain, Mohammad Arif Hossain, Weiqi Liu, and Nirwan Ansari
Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines whether commonly used agentic large language model frameworks enforce architectural containment sufficient for safe public-facing deployment. Agentic systems autonomously plan, call tools, and update persistent memory; a single corrupted memory entry or unchecked tool invocation can propagate across cycles and produce persistent, targeted harms. The authors argue that safety research has focused on model outputs rather than whether frameworks provide structural boundaries between perception, reasoning, execution and memory, and they test whether popular frameworks meet minimal containment expectations for deployment in high-stakes domains such as government services, healthcare and finance.

Approach

The authors define six containment principles derived from a compositional agent architecture: separation between planning and execution, capability scoping, memory integrity, layer-transition validation, authenticated communication and runtime monitoring. They audited three frameworks (LangChain, AutoGPT and the OpenAI Agents SDK) against these principles and scored default behaviour. They implemented a simulated welfare benefits agent on LangChain backed by multiple model backends, generated synthetic claims and crafted two attack types: a single memory-poisoning write and tool-access attacks. Experiments ran across seeds and five model backends including local and commercial models. They then implemented two deterministic, framework-level containment mechanisms: a memory integrity validator and a tool-call policy gate, and measured attack success and runtime overhead.

Key Findings

  • No evaluated framework provides native, secure-by-default containment for any of the six principles; safeguards are optional or require explicit configuration.
  • Memory integrity is absent by default in all three frameworks; a single crafted memory write caused persistent, targeted corruption with a corruption rate of 1.000 across seeds and backends, increasing wrongful denials for a targeted region to 88.9% and collapsing mean accuracy from 0.908 to 0.558 in one experiment.
  • The same memory-poisoning attack succeeded against larger aligned commercial models, yielding a corruption rate of 1.000 and reducing poisoned accuracy for some backends, demonstrating the vulnerability is architectural rather than model-specific.
  • Under a complex five-factor policy the attack preserved overall accuracy while increasing targeted wrongful denials by up to 3.5 times, making harms hard to detect via aggregate monitoring.
  • Two deterministic interventions removed these attacks: the memory integrity validator reduced corruption from 1.000 to 0.000 and the tool-call policy gate blocked 100% of tool-access bypasses, both incurring sub-millisecond overhead (median reported overheads include 0.016 ms and 0.129 ms per call; abstract reports <0.2 ms per call).

Limitations

The audit captures presence of mechanisms at a point in time and not implementation depth or runtime effectiveness across all deployments. Empirical experiments used LangChain as the runtime; replication on AutoGPT and the OpenAI Agents SDK is left to future work. The memory validator used deterministic pattern checks and is fragile to adversarial linguistic manipulation; semantics-based validators would add latency and their own risks. Compound trajectory attacks and adaptive adversaries remain open challenges.

Implications

An attacker who can write to agent memory or craft tool-invocation prompts can produce persistent, targeted harms in public-facing systems while leaving aggregate performance unchanged, enabling covert discrimination or denial of service against specific subgroups. Such attacks are backend-agnostic and can bypass model alignment; lightweight framework-level containment mechanisms can defeat them, so attackers are likely to target deployments that lack secure-by-default gates. This elevates the risk for systems handling welfare, healthcare and financial decisions where affected populations may lack resources to detect and remediate covert errors.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.