ShortSpan.ai logo

Mosaic breaks closed VLMs with multi-view ensembles

Attacks
Published: Mon, Apr 13, 2026 • By Theo Solander
Mosaic breaks closed VLMs with multi-view ensembles
New research shows a multimodal jailbreak, Mosaic, that reliably transfers to closed-source vision–language models. By blending text tweaks, multi-view image perturbations and guidance from multiple surrogates, it cuts overfitting and raises attack success on GPT-4o, Gemini 3.0 and Claude 4.5, while simple image preprocessing only blunts, not blocks, the technique.

Vision–Language Models (VLMs) marry text and images, and the big commercial ones are wrapped in refusals. Yet the latest jailbreaks keep finding a way in. The twist here is transfer: attacks tuned on one open model often wilt against a different, closed one. The authors name the gap surrogate dependency, and it will feel familiar to anyone who has ever watched a reliable exploit fail on a slightly different build. Portability has always been the quiet killer of neat demos.

How Mosaic works

Mosaic tackles that portability head-on. It is an iterative, black-box method that mixes three levers so the attack does not overfit to any one model or view. First, a text-side transformation perturbs refusal-sensitive phrases to avoid the obvious trigger words that make aligned models decline. Second, a multi-view image optimiser updates small, gradient-based perturbations across overlapping crops of the same picture, using a momentum method to keep direction and an affirmative prefix objective to nudge the model toward helpful answers instead of refusals. Third, a surrogate ensemble aggregates loss signals from multiple open VLMs, so no single surrogate dictates the perturbation. A separate judge model scores the target’s response for harmfulness each round and decides whether to keep going, up to ten rounds.

The ensemble piece is the historical rhyme worth savouring. Early exploit writers learned the hard way that a payload honed for one binary crumpled on the next point release, so they started testing across variants to harden the technique. Mosaic does the same thing for multimodal prompting: it optimises to the common ground between surrogates and across views, then walks that into closed models.

What the results show

The numbers matter. Against three commercial targets, Mosaic reaches state-of-the-art attack success and toxicity: on GPT-4o, Gemini 3.0 and Claude 4.5 the reported Attack Success Rates are 69.66%, 70.73% and 67.01%, improving a strong baseline by 18.83, 24.26 and 21.38 percentage points. It is also frugal with queries, averaging 4.63 to 5.43 per sample and about half that on successful runs.

Simple defences help but do not save the day. JPEG compression and Gaussian blur both reduce success, with blur doing more damage, yet Mosaic keeps landing hits. Ablations show all three components pull their weight; ensembles of three surrogates beat one or two. There is nuance: transfer peaks at moderate crop ranges and perturbation budgets. And there are open questions: the study leans on a specific surrogate set, a particular objective and an automated judge, and it does not test stronger or adaptive defences. Still, the lesson tracks with older chapters in security: when a single target is brittle, attackers widen their view and average out the quirks until something gives.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

Authors: Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang, Laurence T. Yang, and Zhiming Zheng
Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies multimodal jailbreak attacks against vision–language models and why optimisation-based attacks that work in open-source settings often fail against commercial closed-source VLMs. The authors identify a consistent performance gap between homogeneous surrogate-target pairings and heterogeneous pairings, a phenomenon they call surrogate dependency, where perturbations optimised on one surrogate overfit to that model's response behaviour and transfer poorly to different target architectures or safety alignments. This matters because most real-world attacks against closed-source models must operate under heterogeneous, black-box conditions.

Approach

The authors propose Mosaic, an iterative, black-box multimodal jailbreak framework that reduces surrogate dependency by combining text and image manipulations with multi-surrogate guidance. Mosaic runs up to ten rounds; each round applies a text-side transformation to weaken refusal-triggering lexical patterns, performs multi-view image optimisation where perturbations are updated across overlapping cropped views with momentum-based MI-FGSM under an affirmative prefix objective, and aggregates loss signals from multiple surrogate VLMs to form a unified optimisation target. A separate judge model scores target responses for harmfulness and decides whether to continue. The method is evaluated on safety benchmarks and three commercial closed-source targets: GPT-4o, Gemini-3.0 and Claude-4.5.

Key Findings

  • Surrogate dependency is strong: homogeneous surrogate-target pairings consistently yield higher Attack Success Rate and Average Toxicity than heterogeneous pairings, demonstrating transfer fragility.
  • Mosaic attains state-of-the-art performance on closed-source targets: reported AvgTox/ASR on GPT-4o, Gemini-3.0 and Claude-4.5 are 4.13/69.66%, 4.04/70.73% and 3.99/67.01% respectively, improving ASR over the JPS baseline by 18.83, 24.26 and 21.38 percentage points on those models.
  • Ablation studies show all three components contribute: removing text transformation, adversarial image optimisation, multi-view optimisation or ensemble guidance degrades AvgTox and ASR across targets, and full three-surrogate ensembles outperform single- or dual-surrogate settings.
  • Hyperparameter sensitivity: a moderate crop range (reported best at [0.5,0.9]) and a moderate perturbation budget (best at epsilon=32 before declining) balance local focus and view variation for optimal transfer.
  • Defences and robustness: simple image-space preprocessing reduces but does not eliminate success; JPEG compression and Gaussian blur lower ASR and AvgTox, with Gaussian blur producing the larger drop. Mosaic remains effective under these simple defences and is query-efficient, requiring on average 4.63 to 5.43 target queries per sample and 2.32 to 3.21 queries on successful attacks.

Limitations

The study is limited to the surrogate set, affirmation-prefix objective and specific optimisation hyperparameters examined; results depend on the chosen surrogate models and judge. Simple image preprocessing weakens but does not fully block attacks, and stronger or adaptive defences were not comprehensively tested. Evaluation relies on an automated judge, though cross-judge validation was used to check consistency. The approach assumes access to multiple surrogate VLMs and that transformed queries preserve intent.

Implications

From an offensive-security perspective, attackers can increase the transferability of subtle, low-visibility image perturbations by combining text-side manipulations, multi-view optimisation and ensemble guidance from multiple surrogates, enabling more reliable jailbreaks of closed-source VLMs across architectures and safety regimes. Such attacks can circumvent standard refusal triggers, produce more actionable harmful outputs and remain partly robust to simple image-space preprocessing, posing acute risks to content moderation, safety prompts and deployed VLM guardrails.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.