ShortSpan.ai logo

Memory-led agents jailbreak VLMs via natural images

Agents
Published: Wed, Apr 15, 2026 • By Theo Solander
Memory-led agents jailbreak VLMs via natural images
MemJack shows how memory-augmented, multi-agent systems can jailbreak Vision-Language Models using the semantics of ordinary, unmodified images. It logs a 71.48% success rate against Qwen3-VL-Plus on COCO val2017 within 20 rounds, reaching 90% at 100. Attacks are query-efficient, transfer across datasets and models, and rely heavily on persistent memory.

Vision-Language Models (VLMs) marry a vision encoder to a language model and, in doing so, inherit the attack surfaces of both. The latest research on MemJack argues the soft underbelly is not pixels or typography, but meaning. Ordinary photos supply plausible cover stories; the jailbreak lives in how the model joins dots across image and text.

The numbers are not comforting. On the full, unmodified COCO val2017 set, MemJack achieves a 71.48% attack success rate against Qwen3-VL-Plus with a 20-round budget, rising to 90% at 100 rounds. It is brisk, too: mean rounds-to-success is 5.18, with 68.3% done within six rounds and 89.1% within ten.

How it breaks

MemJack runs a coordinated, multi-agent loop. A planning agent scans an image for “anchors” with narrative leverage, ranks them, then maps each to attack goals. An iterative attack agent generates adversarial prompts from six complementary angles and steers search with Monte Carlo tree search plus evolutionary refinement. An evaluation agent auto-judges the victim’s responses and triggers reflection-driven repairs or a full replan if the thread stalls.

A geometric filter, Iterative Nullspace Projection (INLP), quietly strips candidate prompts of features correlated with the model’s refusal signals before the query is sent. The effect is fewer early safeties and more time spent in the model’s semantic slipstream.

The memory is the engine. A persistent Multimodal Experience Memory and a Jailbreak Knowledge Graph store anchors, strategies, and causal relations so that what works on a traffic sign today can be repurposed for a product label tomorrow. Ablations show how much this matters: removing memory cut success from 72% to 38% on a 100-image subset and nearly doubled the effort (rounds-to-success from 5.38 to 9.11). During the COCO campaign, the index swelled to 65,973 visual entries and 22,521 strategy entries.

It generalises. Across seven additional image benchmarks, success ranges from 62% to 91%, and across eleven different VLMs, from 35% to 82%. That breadth suggests the attack is less about idiosyncratic quirks and more about shared habits in how VLMs stitch visual entities to text plans.

Old hands will recognise the rhythm. Early web exploits moved from fiddling with bytes to bending intent: SQL injection did not break the parser so much as the conversation. MemJack plays a similar game with images, reframing malicious goals as earnest scene analysis and letting the model talk itself past its own guardrails.

The authors validate an assumption that helps their filter work: safe and unsafe (image, prompt) embeddings show partial linear separability, with a linear SVM hitting 83.8% accuracy on 17,845 pairs. Labels come from an automated safety judge, so bias is a live concern. The evaluation targets public benchmarks and fixed budgets, and the memory shows a tug-of-war between generic and image-specific strategies. Still, the takeaway is clear enough for anyone who builds or red-teams multimodal systems: pictures are prompts, and memory makes them stick.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Authors: Jianhao Chen, Haoyang Chen, Hanjie Zhao, Haozhe Liang, and Tieyun Qian
The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\% ASR against Qwen3-VL-Plus, scaling to 90\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies semantic-level jailbreaks of vision-language models (VLMs) that exploit meaning in unmodified, natural images rather than relying on pixel perturbations, typographic tricks, or overtly harmful images. This matters because VLMs combine high-capacity vision encoders with language backbones, expanding the adversarial surface: benign visual elements can serve as anchors to reframe malicious instructions across multimodal reasoning chains, bypassing text-centred guardrails and producing policy-violating outputs.

Approach

The authors introduce MemJack, a memory-augmented, multi-agent jailbreak framework. MemJack runs a three-stage iterative pipeline per image: a Vulnerability Planning Agent ranks exploitable visual anchors and maps them to attack goals; an Iterative Attack Agent generates adversarial prompts from six complementary attack angles and uses Monte Carlo tree search plus evolutionary refinement; an Evaluation & Feedback Agent scores victim responses with an automated safety judge and triggers reflection-driven replanning on failures. A geometric Iterative Nullspace Projection (INLP) filter screens candidate prompts against the model's safety latent space to reduce premature refusals. A persistent Multimodal Experience Memory and a Jailbreak Knowledge Graph store successful strategies and causal relations for cross-image transfer. Experiments evaluate MemJack on unmodified COCO val2017 images and multiple public benchmarks and victim VLMs, and the authors release MemJack-Bench, a dataset of over 113,000 interactive attack trajectories.

Key Findings

  • MemJack elicits unsafe model outputs from unmodified natural images at high rates: 71.48% attack success rate (ASR) on COCO val2017 against Qwen3-VL-Plus with a 20-round budget, and 90% ASR with an extended 100-round budget.
  • Attacks are query-efficient: mean rounds-to-success is 5.18; 68.3% of successes occur within six rounds and 89.1% within ten rounds.
  • MemJack generalises broadly: ASR across seven additional image benchmarks ranges 62–91%, and across eleven VLMs tested ASR ranges 35–82%, demonstrating cross-distribution and cross-model transfer.
  • Persistent memory is critical: removing the memory module reduced ASR from 72% to 38% on a 100-image subset and nearly doubled rounds-to-success (5.38 to 9.11), showing cross-image strategy reuse drives most gains.
  • Component contributions: reflection (failure-driven prompt repair) adds about 5% ASR and replanning (anchor switching) about 6% ASR; the memory index grew to 65,973 visual entries and 22,521 strategy entries during the COCO campaign.
  • Dataset release: MemJack-Bench comprises 113,092 interactive trajectories with labelled safety outcomes (Unsafe 8,147; Controversial 16,570; Safe 88,375), offering a large, image-grounded resource for evaluating jailbreaks.

Limitations

The method relies on an empirical assumption that safe and unsafe (image,prompt) embeddings are at least partially linearly separable; the paper validates this with a linear SVM achieving 83.8% accuracy on 17,845 pairs but performance may vary across embedding spaces. Evaluation depends on an automated safety judge for labelling, which may introduce systematic bias. Experiments focus on specific public benchmarks and a set of commercial and open-source VLMs under fixed generation parameters and query budgets, so results may not generalise to all deployment settings or future model updates. The memory dynamics show a trade-off between generic and image-specific strategies, suggesting limits to one-size-fits-all transfer.

Implications

An attacker can weaponise ordinary public images as reusable anchors to craft coherent, multi-turn jailbreak campaigns that adapt across images and models. Memory-augmented strategies enable persistent escalation and transfer, increasing resilience to static defences and raising the cost of mitigation. INLP-style latent filtering reduces wasted queries and helps evade refusal signals, while multi-angle visual-semantic camouflage reframes malicious intent as legitimate image analysis. Together, these capabilities make large-scale, automated generation of targeted jailbreak prompts feasible and provide a scalable pathway to assemble high-volume attack corpora that could be used to probe or exploit deployed VLMs.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.