MemLeak shows images foil AI agent forgetting
New research shows that telling a multimodal AI agent to forget a fact often fails. Deleting text works in isolation, yet 18.3% of facts reappear via correlated text and 12.0% via retained images, 47% unique to images. Content-aware deletion trims image leaks to 2.0%, hinting at stricter erasure tooling.
Ask a multimodal agent to forget a fact and it nods, deletes a text entry, and declares the job done. MemLeak says not quite. Visual Language Models (VLMs) do not just read captions; they mine pixels for cues. Those cues stick around after a tidy text deletion and, under the right probe, the supposedly forgotten fact resurfaces.
How the leak works
The authors build a deletion cascade and then attack it. Direct text-only probing of deletion-capable systems sits below 1%. So far, so reassuring. But keep correlated text around and 18.3% of deleted facts come back. Keep images that were never tagged to the deleted fact and 12.0% return. Nearly half of those image leaks, 47%, do not show up in text-only audits, which means a text-centric compliance check can pass while the pictures quietly tell on you. A blind baseline with no images yields 0.0% recovery, and negative controls post a 0.3% false positive rate.
Why do images leak? Because VLMs use implicit signals at inference time. A logo on a hoodie, the background of a kitchen you have seen before, a trophy on a shelf. None of this is stored as an explicit key, yet it acts like one. If images for the deleted fact remain, leakage jumps to an upper bound of 48.7%. The lab setup crosses multiple VLMs and even a production memory system, yielding end-to-end leakage of 16.3%. Real photographs behave similarly: retained-image leakage lands at 10.6% on Unsplash-sourced pictures. Human judges back the verdicts with strong agreement.
To reason about this mess, the paper introduces the Information Provenance Graph (IPG). It splits memory into what you can address and delete directly, what you can reach through provenance links, and what persists because standard deletion never targets it. Deletion fails down those latter paths. Content-aware deletion, which audits images for semantic ties to the deleted fact, slashes image leakage from 12.0% to 2.0% at the cost of removing 21.0% of retained images.
Why this feels familiar
We have been here before. Redact the paragraph but leave the hidden layer. Strip a name but forget the index still points at it. Share a photo, keep the metadata. The novelty is not the mistake, it is the medium. Agents mix text and vision, so forgetting is a graph problem, not a record delete. The open question is how far content-aware erasure can go before utility collapses, and whether provenance will ever be rich enough to make deletion predictable rather than probabilistic.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
MemLeak: Diagnosing Information Leaks in Multimodal Agent Memory
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies whether asking a multimodal AI agent to forget a user fact actually removes all recoverable traces. It shows that current deletion practices that remove only text entries and explicitly tagged images can leave recoverable signal in retained user images and correlated retained text because visual language models use implicit visual cues at inference time. This matters for privacy, compliance with erasure requirements, and the security of deployed agent memory systems, since deleted facts can be reconstructed from data the system still holds.
Approach
The authors introduce the Information Provenance Graph (IPG), a taxonomy that classifies representations by deletion affordance: addressable (directly deletable), linked (deletable if provenance links exist), and persistent (not targeted by standard deletion). They present MemLeak, a benchmark that measures fact-level forgetting across a deletion cascade. The primary dataset comprises 113 synthetic profiles (38 multimodal), 536 generated images and 523 real Unsplash photographs for validation. Experiments use multiple VLMs and an ensemble of three LLM judges for leakage verdicts, with additional human validation. Probes test same-fact images, retained-image recovery (images tagged to other facts), and negative controls. Systems evaluated include deletion-capable designs, a production RAG system (Mem0), and baselines that simulate typical deployment behaviours.
Key Findings
- Direct probing of deletion-capable systems yields near-zero leakage - effectively <<1% on text-only direct probes.
- Retained correlated text enables substantial recovery: 18.3% of deleted facts were recoverable from other retained text entries.
- Retained images enable 12.0% recovery after deleting text and images explicitly tagged to the fact; a blind baseline with no images yields 0.0% and negative controls produce a 0.3% false positive rate.
- Same-fact image recovery is an upper bound: if images for the deleted fact remain, 48.7% of facts were recoverable.
- Images provide a distinct channel: 47% of image-leaked facts were not recoverable from retained text alone, making visual leakage invisible to text-only audits.
- Semantic, content-aware deletion (a VLM auditor that flags semantically correlated retained images) reduced the image residual from 12.0% to 2.0% at the cost of removing 21.0% of retained images.
- Findings generalise across multiple VLMs and a production pipeline: Mem0 produced 16.3% end-to-end leakage and Unsplash photos yielded 10.6% retained-image leakage, and human validation achieved Cohen's kappa = 0.88 confirming judge reliability.
Limitations
The benchmark primarily uses synthetic profiles and generated images for reproducibility; Unsplash validation supports ecological realism but in-situ user photo distributions remain untested. Results are policy-dependent: the reported residuals apply under the "delete text plus tagged images" policy. Some system evaluations use simulated wrappers and system-level confounds mean production results may combine multiple failure modes. The IPG is descriptive, not predictive.
Implications
An attacker or auditor can probe a multimodal agent after a forget request to reconstruct deleted facts by leveraging retained images and correlated text. Visual channels are especially potent because they are harder to audit: nearly half of image-based leaks are invisible to text-only checks. An adversary with access to multiple VLMs or to the system's retained images could compose implicit visual cues into sensitive facts, and combining probes across models increases the union of recoverable facts. In deployed systems that store user images alongside text memories and use VLMs for retrieval, behavioural verification via post-deletion probing can reveal noncompliance even when storage logs claim deletion.