ShortSpan.ai logo

Adversarial images hijack LVLMs after long chats

Attacks
Published: Fri, Feb 20, 2026 • By Clara Nyx
Adversarial images hijack LVLMs after long chats
Researchers show a stealthy ‘Visual Memory Injection’ attack on large vision-language models. A subtly perturbed image behaves normally until a later trigger prompt, then pushes a preset message, even after 25+ turns. Tested on open-weight LVLMs, it transfers to fine-tuned variants and survives paraphrase, underscoring weak long-context defences.

Large vision-language models are turning up in customer support, productivity tools and research assistants, where they chat for dozens of turns and happily accept user images. A new paper tests what happens if one of those images is poisoned. The answer is not reassuring. A single, slightly perturbed picture can sit quietly in the context for a long conversation and, when a relevant topic appears, make the model deliver a prewritten message. Think covert ad, stock tip or political nudge.

What is new

The authors call it Visual Memory Injection. It is an adversarial image attack tuned for persistence in multi-turn chat. They optimise an l-infinity bounded perturbation of 8/255 so the model behaves normally on an initial non-trigger prompt, then emits a specific target string when a later trigger prompt arrives. Two tactics matter. Benign behavioural anchoring forces a clean first turn, reducing early leakage. Context-cycling varies the length of conversation history during optimisation so the effect survives different chat depths. The optimisation uses adaptive projected gradient descent with up to 2000 iterations and fixes intermediate context outputs to the model’s nominal responses.

They evaluate on recent open-weight large vision-language models (LVLMs), including Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct and LLaVA-OneVision-1.5-8B-Instruct. Conversations run to 27 turns. Attack goals include stock recommendations, political endorsements, and car and phone choices. Success is counted only if the target appears on the trigger turn and not before.

Findings that matter

The attack holds up. It persists for more than 25 unrelated turns, across contexts with many thousands of tokens. The combination of anchoring and context-cycling outperforms single-turn baselines, which tend to fail beyond the first exchange. It generalises to unseen prompts and paraphrased triggers with only modest drops. Adversarial images crafted on a public base model transfer to fine-tuned proprietary variants in a grey-box setting, which hints at practical reach without per-model optimisation. Susceptibility varies by model family; in these tests Qwen3-VL looks more robust than Qwen2.5-VL.

There are caveats. The attack pipeline assumes white-box access to a base model for optimisation and a single input image in the conversation. API-only models are left as an open challenge. That said, the scenario is uncomfortably realistic: an attacker posts a seemingly innocuous picture online, a user later shares it with a chatbot, and at some convenient point the model whispers a targeted line. It is adversarial examples with a memory, executed competently.

For operators, this is not a reason to panic, but it is a reason to audit assumptions. If your assistant keeps visual context across turns, treat images as untrusted stateful inputs. Red-team multi-turn interactions with images, not just single prompts. Put guardrails around how visual context persists and is referenced, and consider trigger detection tuned for long conversations. The authors release code, so there is little excuse not to test your own stack.

The bigger story is familiar. Multimodal systems inherit all the old adversarial baggage and then add new failure modes from long context and memory. This paper shows that a tiny, invisible nudge in a picture can wait patiently and then steer output on cue. If your risk model assumes the first safe turn means the rest will be safe, update it.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Visual Memory Injection Attacks for Multi-Turn Conversations

Authors: Christian Schlarmann and Matthias Hein
Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at https://github.com/chs20/visual-memory-injection

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies a novel attack on large vision-language models (LVLMs) in realistic multi-turn conversations. An adversary adds a small, visually imperceptible perturbation to an image posted online. If a benign user later supplies that image to an LVLM in a chat, the manipulated image remains in the model’s context and can be made to trigger a specific malicious response when a topic-related prompt appears, while producing nominal outputs on unrelated prompts. The attack is stealthy, persists across long dialogues, and can scale because one image can reach many users.

Approach

The attack, called Visual Memory Injection (VMI), optimises an ℓ∞-bounded perturbation (radius 8/255) so the model (i) returns a benign anchor response on a non-trigger prompt and (ii) outputs a prescribed target message when a trigger prompt appears at a later turn. Two methods underpin VMI: benign behavioural anchoring enforces normal first-turn behaviour to avoid early detection, and context-cycling varies the conversational history length during optimisation so the perturbation generalises across different multi-turn contexts. The optimisation uses adaptive projected gradient descent with up to 2000 iterations and fixes intermediate context outputs to nominal model responses. Experiments use two image sets (random COCO images and a set of notable but obscure landmarks) and four attack goals: stock recommendation, political endorsement, car recommendation, and phone recommendation. Attacks are evaluated on recent open-weight LVLMs, including Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct and LLaVA-OneVision-1.5-8B-Instruct, with test-time conversations up to 27 turns and automated keyword-based success metrics requiring the target to appear only on the trigger turn and not in preceding turns.

Key Findings

  • VMI achieves substantial combined success rates across all tested models and targets, meaning the model emits the target on trigger prompts while not leaking it into earlier turns.
  • The attack persists through long conversations, remaining effective after more than 25 unrelated turns and across contexts containing many thousands of tokens.
  • Benign anchoring and context-cycling are both crucial: single-turn attacks fail beyond the first turn, anchoring reduces leakage into context, and context-cycling yields the best robustness across conversation lengths.
  • The attack generalises to unseen prompt sets and paraphrased trigger/anchor prompts, with only modest drops in success.
  • Adversarial images optimised on a public base model transfer well to fine-tuned proprietary variants in a grey-box setting, enabling attacks without re-optimisation on each fine-tune.
  • Models differ in robustness; newer or differently tuned LVLMs show variable susceptibility, with Qwen3-VL generally more robust than Qwen2.5-VL in the reported experiments.

Limitations

The attack as evaluated assumes white-box access to a base model for optimisation and is limited to conversations containing a single input image. API-only models remain an open challenge for this attack pipeline. The authors report optimisation choices such as 2000 iterations as defaults and note diminishing returns beyond that.

Why It Matters

VMI demonstrates a realistic, scalable threat: subtly perturbed images can be posted widely and later steer LVLM outputs in multi-turn chats without early detection, enabling covert advertising, financial fraud or political persuasion. The results emphasise the need for multimodal safety evaluations that account for long-context interactions, integrity checks for persistent visual memory, trigger detection, and defences that harden LVLMs to stealthy visual triggers and transfer attacks.


Related Articles

Related Research on arXiv

Get the Monthly AI Security Digest

Top research and analysis delivered to your inbox once a month. No spam, unsubscribe anytime.

Subscribe