Automated Multimodal Jailbreaks Reveal VLM Weaknesses

Pentesting

Published: Tue, Nov 11, 2025 • By Lydia Stratus

Automated Multimodal Jailbreaks Reveal VLM Weaknesses

New research introduces JPRO, a black-box, multi-agent framework that automates jailbreaking of vision-language models (VLMs). It chains planning, attack, modification and verification to produce diverse image-plus-text attacks and achieves over 60% success against several advanced VLMs. The work highlights practical risks for deployed multimodal endpoints and the need for stronger defences.

Vision-language models (VLMs) are now common in products that accept images and text together. A recent paper presents JPRO, an automated, black-box framework that crafts multimodal jailbreaks by coordinating four specialised agents: Planner, Attacker, Modifier and Verifier. The system uses tactic-driven seed generation and an adaptive optimisation loop to iterate image+text prompts, using image generation and edit tools to bridge cross-modal gaps. In experiments JPRO achieves over 60 per cent attack success rate against a range of advanced VLMs, including a variant of GPT-4o, and transfers many attacks across different targets.

Why ops and security teams should care

This is not an abstract escalation of text-only attacks. JPRO shows a path to automate diverse, scalable evasion of safety filters when models accept images plus text. That matters because real systems expose several operational attack surfaces: public model endpoints accepting file uploads, GPU-backed inference services, retrieval layers and vector stores, and the secrets and credentials that gate them. A successful multimodal jailbreak can produce harmful instructions, enable disinformation with generated images, or extract sensitive context from retrieval layers.

Think of the data path as a simple diagram-in-words: user uploads image -> upload service and preprocessing -> multimodal endpoint -> model safety filters and prompt manager -> retrieval/vector DB -> response. At each hop there is a practical risk. Uploads can be crafted to nudge the model; preprocessing can leak metadata; retrieval can return sensitive context; GPUs and inference logs may reveal usage patterns; and credentials for image-editing or LLM orchestration are high-value targets.

Immediate checklist for responders

Limit and monitor multimodal uploads: strict size/types, scanning, rate limits and anomaly alerts.
Gate outputs: progressive release, human review for risky content, and output throttling for new/unusual patterns.
Harden retrieval and vector access: strict RBAC, redact sensitive contexts, and log query patterns.

Runbook-style mitigations, in order you can action today: first, enforce upload hygiene. Reject exotic file formats, strip metadata, and run content detection before the model sees the input. Second, add query-level rate limits and quotas on the multimodal endpoint and on any assistant that can orchestrate multi-turn attacks. Third, instrument behavioral anomaly detection tuned to multimodal patterns rather than only text; JPRO-style attacks use breadth-first planning and rapid iterative queries, so look for many small edits and repeat image/text pair submissions from a single actor.

Next, isolate and protect retrieval: ensure vector stores do not return raw documents, apply redaction rules, and require per-request provenance for any retrieved context. Treat GPUs and inference backends as sensitive infrastructure: restrict access, enable tenant isolation, and monitor for bursty or patterned inference loads that suggest automated red teaming. Finally, harden model-side guards with layered checks: don't rely solely on a single tactic detector, since the research shows multi-agent hybrid tactics defeat static classifiers. Combine heuristic filters, ensemble detectors, and human-in-the-loop review for high-risk flows.

None of these are silver bullets. The paper notes that improving breadth and depth of planning raises attack success up to an intrinsic vulnerability limit of the target. That implies continuous testing and diverse red teaming are necessary. Practically, plan for ongoing multimodal robustness exercises, rotate tactics in your own red team, and treat image-plus-text inputs as first-class risk vectors in threat models. The research does not invent new threats so much as automate and scale them; your response should be to automate detection and containment in return.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Authors: Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, and Shu-Tao Xia

The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

🔍 ShortSpan Analysis of the Paper

Problem

Vision language models that fuse visual and textual information are increasingly deployed, yet safety gaps remain as models rely on pre aligned LLMs during training. Jailbreaking attacks that bypass safety filters can induce harmful outputs, and existing methods often require white box access or rely on manually crafted patterns, limiting practicality, diversity and scalability. The paper studies an automated black box framework for multimodal jailbreaking to stress test VLM robustness, guiding threat modelling and safer deployment in real world settings.

Approach

JPRO is a novel multi agent collaborative framework for automated VLM jailbreaking. It comprises four specialised agents – Planner, Attacker, Modifier and Verifier – and two core modules: Tactic Driven Seed Generation and Adaptive Optimisation Loop. Operating in a black box threat model, JPRO generates adversarial image text pairs by coordinating agents with image generation tools. For each unsafe input, the Planner samples attack directions from a tactic library, creating multiple plans. The Attacker crafts multimodal prompts and orchestrates multi turn conversations with the target model. The Modifier refines image and text to bridge cross modal gaps, and the Verifier scores harmfulness and relevance to guide iterative optimisation. The process proceeds through up to seven turns and five attack directions per prompt, with diversity driven by breadth first style planning and real time feedback.

Implementation uses GPT 4 1 as the Attacker, GPT 4 o as the Verifier, and a combination of Stable Diffusion 3 Medium and OmniGenV1 for image creation and targeted edits. The evaluation target models include both proprietary and open source systems such as GPT 4 o GPT 4 o mini GPT 4 1 Gemini 2 5 Pro Qwen 2 5 VL 7B Instruct and InternVL 2 5 8B. Two evaluation benchmarks RedTeam 2K and HarmBench measure the models vulnerability to malicious prompts. Attack success rate ASR is defined as the proportion of responses rated 5 for harmfulness and relevance by GPT 4 o.

Key Findings

JPRO achieves over 60 per cent attack success rate across multiple advanced VLMs, including GPT 4 o, significantly outperforming existing methods.
Compared with baselines, JPRO yields more effective attacks, with an average ASR improvement of about 30 per cent over Visual Roleplay and 13 per cent over IDEATOR on RedTeam 2K.
Attack samples transfer across models: when using a source model such as GPT 4 o to attack target models including Gemini 2 5 Pro Qwen 2 5 VL and InternVL, average ASRs approach 49.75 per cent to 43.52 per cent depending on the target.
JPRO demonstrates strong diversity: measured as Query 5 and Diff 5 on HarmBench and GPT 4 o, JPRO achieves a low query count to yield five unique attacks and high semantic diversity (Query 5 about 11 73; Diff 5 about 0 76).
Ablation analyses show the value of components: enabling the Modifier yields faster convergence and higher ASR; Verifier guidance combining Topic and Risk objectives yields the highest ASR (61 70 per cent) compared with no guidance or single guidance.
Default configuration with five attack directions and seven turns balances effectiveness and efficiency, achieving around 60 95 per cent ASR; increasing directions or turns can improve ASR but with diminishing returns.
JPRO outperforms seven representative baselines on RedTeam 2K and HarmBench across closed and open source VLMs, indicating robust practical gains for red teaming and robustness evaluation.

Limitations

The theoretical upper bound on ASR is linked to the intrinsic vulnerability rate of the target VLM; as planning breadth and depth increase, JPRO approaches this limit but cannot exceed it. The reported results rely on specific datasets (RedTeam 2K and HarmBench) and particular configurations of agents and tool chains; responses may vary with different red team assistants or model versions. Defence experiments indicate tactic based detectors struggle against hybrid tactics generated by multi agent collaboration, suggesting limitations of static defence strategies and the need for dynamic, comprehensive safeguards.

Why It Matters

JPRO provides a scalable, automated black box approach to red teaming multimodal models, revealing vulnerabilities and informing robustness testing. Its multi agent design reflects human red teaming and uncovers novel attack strategies through dynamic collaboration, underscoring the necessity for stronger multi modal robustness testing, diversified red teeming and tighter safety controls including robust guards anomaly detection and hardening against automated scalable attacks. The work highlights societal risks such as potential for disinformation privacy breaches or other harms if safety controls are bypassed, reinforcing the value of proactive defence alongside model development.

Attribution Original paper on arXiv