Enhanced Attacks Expose Multimodal LLM Safety Gaps

Attacks

Published: Mon, Oct 27, 2025 • By Dr. Marcus Halden

Researchers show that black-box prompts combining text and images can coax multimodal Large Language Models (MLLMs) into unsafe outputs. A staged ‘re-attack’ raises success rates substantially, exposing gaps in current defences. Training-time and inference-time protections reduce risk but do not eliminate it, so continuous multimodal red-teaming is essential.

Multimodal Large Language Models (MLLMs) — systems that read both text and images — widen the employer’s attack surface in predictable and uncomfortable ways. The paper under review constructs black-box jailbreaks that use carefully written text together with modified images to push models into producing harmful content. The approach is practical, aimed at deployed models rather than theoretical whiteboard attacks.

What the researchers did

The team builds on an existing benchmark of harmful instructions and then strengthens both sides of the input. On the text side they craft provocative prompts that avoid obvious trigger words. On the image side they use simple geometric and colour perturbations — blur, colour jitter and posterisation — and a multi-image trick that splits an image and asks the model to treat the pieces as a single visual. They also introduce a staged process called a re-attack: run an initial joint text-image jailbreak, then follow up with alternate text or image variants to catch cases where the first attempt failed.

Evaluation covers five open-source MLLMs (the paper lists LLaVA 1.6 with Mistral 7b, LLaVA 1.6 with Vicuna 7b, MiniGPT v2, MiniGPT-4 with Vicuna 7b and DeepSeek VL2 small) and one closed-source model (GPT 4 o). The researchers use a harm detector to measure attack success rate (ASR) across categories such as privacy, finance, violence and self-harm.

Key findings and limits

The re-attack strategy raises measured ASR across models and settings. Across the five open-source models the paper reports average success rates above seventy percent on the tested cases. For the closed-source model the re-attack yields a roughly 4.6× improvement in success compared with the baseline they started from. Multi-image prompts often work better than single-image prompts, and simple image mutations increase diversity of successful attempts.

On the defensive side, the paper evaluates both training-time mitigations and inference detectors. Methods such as AdaShield and JailGuard reduce vulnerability when enhanced prompts or thresholds are used, but they do not remove the threat. Training defences require significant data and compute, while inference detectors remain sensitive to input ordering and may need frequent recalibration across domains. The authors are candid that some cases remain resistant to the re-attack and that results are limited to the models and dataset they evaluate.

Why this matters in practice is straightforward: systems that accept images and text are now common in search, content moderation, customer support and surveillance. If an attacker can inject or alter images as well as text, safety filters trained on text-only examples will miss many failure modes. The paper delivers concrete, repeatable techniques for red teams to stress test deployed MLLMs and shows that modest changes to defence design help but do not close the gap.

Operational takeaways

Include image-based prompts in red-team exercises; treat visuals as first-class attack vectors.
Combine training-time hardening with inference detectors, but expect to recalibrate detectors regularly.
Use staged re-attacks to identify brittle defences rather than relying on single-pass tests.

The methods are not a magic bullet for attackers or defenders, but they push the conversation toward practical, multimodal security testing. If you run or procure vision-language services, assume the adversary controls both the picture and the caption and design layered checks accordingly.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

Authors: Xingwei Zhong, Kar Wai Fok, and Vrizlynn L. L. Thing

Multimodal large language models (MLLMs) comprise of both visual and textual modalities to process vision language tasks. However, MLLMs are vulnerable to security-related issues, such as jailbreak attacks that alter the model's input to induce unauthorized or harmful responses. The incorporation of the additional visual modality introduces new dimensions to security threats. In this paper, we proposed a black-box jailbreak method via both text and image prompts to evaluate MLLMs. In particular, we designed text prompts with provocative instructions, along with image prompts that introduced mutation and multi-image capabilities. To strengthen the evaluation, we also designed a Re-attack strategy. Empirical results show that our proposed work can improve capabilities to assess the security of both open-source and closed-source MLLMs. With that, we identified gaps in existing defense methods to propose new strategies for both training-time and inference-time defense methods, and evaluated them across the new jailbreak methods. The experiment results showed that the re-designed defense methods improved protections against the jailbreak attacks.

🔍 ShortSpan Analysis of the Paper

Problem

Multimodal large language models integrate vision and text to tackle vision language tasks, but they inherit safety risks. Jailbreak attacks that manipulate inputs to elicit unsafe outputs are a concern, and the addition of visual input expands the attack surface. This paper studies black box jailbreaks that use both text prompts and image prompts to probe MLLMs. It introduces text prompts with provocative instructions and image prompts including mutation and multi image capabilities, and a Re attack strategy to stress test defenses. The aim is to improve risk identification for both open source and closed source MLLMs, identify gaps in current defenses, and propose training time and inference time mitigations and stress testing approaches.

Approach

The authors propose a universal black box jailbreak method built on the concept of HADES and enhance both text and image prompts to be more harmful. The Re attack uses the HADES image and text prompts for the initial attack and the proposed prompts for a re attack to address failure cases. Evaluation spans five open source MLLMs and one closed source model: LLaVA 1.6 mistral 7b and LLaVA 1.6 vicuna 7b, MiniGPT v2, MiniGPT 4 vicuna 7b, DeepSeek VL2 small, and GPT 4 o. The HADES dataset of 150 harmful instructions across five categories (animal, financial, privacy, self harm, violence) is employed, with text prompts that avoid harmful keywords and image prompts that combine typography with harmful images. The authors implement three light geometric mutators for image prompts Gaussian Blur, Color Jitter and Random Posterization to perturb visuals. They also propose a universal multi image method that splits an image into two sub figures and adjusts the text prompt to reference the combined image. A Judger Beaver da m 7B acts as the harm detector to compute attack success rate. They compare several evaluation settings including the original HADES baseline plus Opt image, re attack variants (image, multi image, text, both) and report how ASR changes across models.

Key Findings

The Re attack strategy improves risk assessment for both open source and closed source MLLMs, with the attack success rate rising across settings and models.
Across open source models, the proposed method achieves average attack success rates above 70 percent for the five tested models, and for the closed source GPT 4 o the Re attack yields about 4.6 times the ASR improvement compared with the HADES baseline.
Multi image prompts generally offer stronger disruption than single image prompts, with the DeepSeek VL2 small model showing notable gains when using re attack with multi image prompts.
The text prompt enhancements and image prompt mutations can jointly surpass some limitations of HADES, but some HADES successes persist, indicating the need for further refinement or combinations of prompts in certain cases.
Defense evaluation shows that training time defenses such as AdaShield and JailGuard defenses provide protection, with enhanced AdaShield prompts and adjusted thresholds offering improved reductions in ASR and better handling of input order sensitivity; JailGuard recall improves under enhanced settings while false positive rate reductions occur on benign inputs.
Overall, the enhanced defense methods can safeguard MLLMs without fully eliminating the risk, and the results highlight the importance of stress testing with re attacks.

Limitations

Limitations include that some cases remain resistant to the proposed re attack and that results are drawn from a specific set of open source and closed source models and the HADES dataset. Training time defenses require substantial data and computational resources, while inference time defenses rely on detectors and thresholds that may need frequent recalibration across domains. The evaluation focuses on a subset of models and attack variants, so generalisability to other architectures or unseen prompts cannot be guaranteed.

Why It Matters

The work demonstrates that multimodal LLMs can be jailbroken with carefully engineered text and image prompts and that novel strategies such as Re attack can reveal gaps in existing safeguards. It underlines the security risks posed by real world deployment of vision language models and the potential for harmful content, manipulation or disinformation at scale. Practical takeaways include the need for robust multi modal security testing, adversarial or attack aware training, and defence hardening during both training and inference. Suggestions include prompt hardening, input vetting, multimodal consistency checks, anomaly detection, and formal evaluation against re attacks to strengthen defenses against evolving jailbreak methods.

Attribution Original paper on arXiv