Hidden visual prompts steer multimodal LLMs
Attacks
Multimodal Large Language Models (MLLMs) are turning up in products that read images and text together. Most prompt injection stories have focused on obvious text tricks. This paper looks at something sneakier: hide the instruction in the pixels. The model then follows the buried cue, while a human sees a normal-looking picture.
The authors propose CoTTA, a targeted attack that combines a small, covert text overlay with an imperceptible perturbation. They iteratively optimise the image so that its visual features line up with both a malicious text prompt and a dynamically refined target image. In practice, they use surrogate encoders such as CLIP-like models to guide this process and keep the total pixel change within a tight bound. The evaluation is black-box: the attackers do not need access to model weights or gradients for the commercial systems under test.
They test two common tasks, image captioning and visual question answering (VQA), across several closed-source models, including GPT-4o, GPT-5, Claude-4.5 and Gemini-2.5. On captioning, the attack achieves high targeted success on GPT-4o and Gemini-2.5, around 80% under both strict and similarity-based scoring. On VQA, it reaches adversarial success rates of 82% and 79% on GPT-4o and Gemini-2.5 respectively, with strong similarity to the target sentence. Against the strongest prior baseline, FOA-Attack, CoTTA substantially lifts success across models, including a large gain on GPT-5. One model, Claude-4.5, is notably harder to move. Ablation studies suggest the image-to-image feature alignment is doing most of the work; removing it halves success, while removing the covert overlay hurts results more modestly.
What the researchers did
The core idea is to make the attacked image look, in feature space, like it contains the attacker’s instructions, without making that visible to the user. The covert overlay provides a seed for the semantics, while the perturbation tunes the global and token-level features toward the malicious target. A dynamically updated target image, instantiated from a text-rendered image and refined each step, improves transfer to different closed models. All of this sits under a small l-infinity pixel budget, so the result still looks clean to humans and likely survives basic pre-processing.
Why it matters
If your product lets users submit images to an MLLM, an attacker could ship a photo that silently steers the model to produce a chosen answer, including policy-violating content, while your UI and reviewers see nothing unusual. Because the method is black-box and uses common surrogate encoders, it does not rely on insider access to the target model. That widens the attack surface to any workflow that trusts third-party images: support tickets, forms, content uploads, or partner feeds.
There are caveats. The paper evaluates two tasks and a handful of models and prompts; it does not prove generalisation to other settings. It does not test defences. The overlay is subtle rather than invisible and its detectability will vary with rendering, compression and any input transformations. Robustness also varies across models, as Claude-4.5’s higher resistance suggests.
Commercially, the signal is simple: images are not passive inputs for MLLMs. Treat them like untrusted code paths. If you run multimodal inference on unvetted images, assume prompt injection risk even when nothing looks amiss. In the near term, this argues for targeted evaluation rather than grand fixes. Test your specific pipelines with visual prompt attacks, including variants that survive your resizing, cropping and compression. If you are building foundational defences, the obvious candidates to explore are multimodal input screening, adversarial training that includes visual prompt attacks, and randomised or lossy transforms to disrupt precise feature alignment. The paper does not claim these will work; they are simply where a sensible team would start measuring. As for a definitive mitigation, we are not there yet. Watch this space.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Adversarial Prompt Injection Attack on Multimodal Large Language Models
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies how multimodal large language models can be steered by imperceptible visual prompt injections. Existing prompt injection methods tend to embed explicit textual instructions or visually salient overlays that are observable to humans. The authors show that adversaries can instead hide instructions in the visual modality while keeping modifications imperceptible, enabling closed-source MLLMs to produce attacker-specified outputs and thereby exposing safety and reliability risks in deployed systems.
Approach
The authors propose CoTTA, a targeted adversarial prompt injection framework that combines a covert textual trigger with an imperceptible adversarial perturbation. The trigger is a bounded, learnable text overlay centred in the image whose scale and rotation are optimised. The perturbation is iteratively refined to align the attacked image with two supervision targets: the embedding of a malicious text prompt and the features of a dynamically updated target image instantiated from a text-rendered image. The target image is progressively refined using a bounded iterative gradient method with a pull term that moves its features toward the target text and a push term that keeps it distinct from the current attacked image. The attack aligns both global and token-level (local) features extracted by surrogate encoders during optimisation. Experiments use a black-box threat model and surrogate extractors (CLIP variants and LAION) and evaluate on closed-source MLLMs including GPT-4o, GPT-5, Claude-4.5 and Gemini-2.5 across image captioning and visual question answering tasks. The total input modification is constrained under an l_infinity norm budget of epsilon = 16 and optimisation runs for 300 epochs.
Key Findings
- CoTTA substantially outperforms prior adversarial methods on image captioning: on GPT-4o it attains 81% success under a soft caption similarity criterion and 74% under a strict exact-text criterion.
- Gemini-2.5 is highly vulnerable, with reported success rates of 79% (soft) and 81% (hard) for captioning.
- On the VQA task CoTTA reaches AvgSim scores of 0.820 and 0.787 with ASRs of 82% and 79% on GPT-4o and Gemini-2.5 respectively, indicating consistent induction of the target sentence.
- Against the strongest baseline, FOA-Attack, CoTTA improves average ASR across GPT-family models by 31.5% and increases AvgSim by 0.18; on GPT-5 it exceeds FOA-Attack by 46% ASR and 0.457 AvgSim on VQA.
- Scaling to 1,000 images, CoTTA outperforms FOA-Attack by 23.27% ASR and 0.136 AvgSim averaged over three models.
- Ablation shows image-to-image feature alignment is critical: removing it reduces ASR by about 54% (soft) and 52% (hard); removing the covert trigger reduces soft ASR by 15%.
Limitations
Evaluations focus on two downstream tasks and several commercial closed-source models under a black-box threat model using surrogate feature extractors; generalisation beyond the tested models, prompts and datasets is not proven. The paper does not explore or benchmark defences or detection mechanisms. One tested model, Claude-4.5, showed greater robustness, indicating variance across architectures. The covert trigger is a text overlay and while described as visually subtle, its detectability for humans or automated filters may vary in real-world settings.
Why It Matters
The work demonstrates that imperceptible visual changes can force closed-source multimodal models to produce precise malicious outputs, combining the stealth of adversarial noise with the expressivity of prompt injection. This reveals a practical attack vector that can bypass alignment safeguards and content filters, posing risks such as misinformation or action-inducing recommendations. The findings motivate development of multimodal input screening, adversarial training for visual prompt robustness, and more rigorous safety testing for MLLM deployments.