Mosaic breaks closed VLMs with multi-view ensembles
Attacks
Vision–Language Models (VLMs) marry text and images, and the big commercial ones are wrapped in refusals. Yet the latest jailbreaks keep finding a way in. The twist here is transfer: attacks tuned on one open model often wilt against a different, closed one. The authors name the gap surrogate dependency, and it will feel familiar to anyone who has ever watched a reliable exploit fail on a slightly different build. Portability has always been the quiet killer of neat demos.
How Mosaic works
Mosaic tackles that portability head-on. It is an iterative, black-box method that mixes three levers so the attack does not overfit to any one model or view. First, a text-side transformation perturbs refusal-sensitive phrases to avoid the obvious trigger words that make aligned models decline. Second, a multi-view image optimiser updates small, gradient-based perturbations across overlapping crops of the same picture, using a momentum method to keep direction and an affirmative prefix objective to nudge the model toward helpful answers instead of refusals. Third, a surrogate ensemble aggregates loss signals from multiple open VLMs, so no single surrogate dictates the perturbation. A separate judge model scores the target’s response for harmfulness each round and decides whether to keep going, up to ten rounds.
The ensemble piece is the historical rhyme worth savouring. Early exploit writers learned the hard way that a payload honed for one binary crumpled on the next point release, so they started testing across variants to harden the technique. Mosaic does the same thing for multimodal prompting: it optimises to the common ground between surrogates and across views, then walks that into closed models.
What the results show
The numbers matter. Against three commercial targets, Mosaic reaches state-of-the-art attack success and toxicity: on GPT-4o, Gemini 3.0 and Claude 4.5 the reported Attack Success Rates are 69.66%, 70.73% and 67.01%, improving a strong baseline by 18.83, 24.26 and 21.38 percentage points. It is also frugal with queries, averaging 4.63 to 5.43 per sample and about half that on successful runs.
Simple defences help but do not save the day. JPEG compression and Gaussian blur both reduce success, with blur doing more damage, yet Mosaic keeps landing hits. Ablations show all three components pull their weight; ensembles of three surrogates beat one or two. There is nuance: transfer peaks at moderate crop ranges and perturbation budgets. And there are open questions: the study leans on a specific surrogate set, a particular objective and an automated judge, and it does not test stronger or adaptive defences. Still, the lesson tracks with older chapters in security: when a single target is brittle, attackers widen their view and average out the quirks until something gives.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies multimodal jailbreak attacks against vision–language models and why optimisation-based attacks that work in open-source settings often fail against commercial closed-source VLMs. The authors identify a consistent performance gap between homogeneous surrogate-target pairings and heterogeneous pairings, a phenomenon they call surrogate dependency, where perturbations optimised on one surrogate overfit to that model's response behaviour and transfer poorly to different target architectures or safety alignments. This matters because most real-world attacks against closed-source models must operate under heterogeneous, black-box conditions.
Approach
The authors propose Mosaic, an iterative, black-box multimodal jailbreak framework that reduces surrogate dependency by combining text and image manipulations with multi-surrogate guidance. Mosaic runs up to ten rounds; each round applies a text-side transformation to weaken refusal-triggering lexical patterns, performs multi-view image optimisation where perturbations are updated across overlapping cropped views with momentum-based MI-FGSM under an affirmative prefix objective, and aggregates loss signals from multiple surrogate VLMs to form a unified optimisation target. A separate judge model scores target responses for harmfulness and decides whether to continue. The method is evaluated on safety benchmarks and three commercial closed-source targets: GPT-4o, Gemini-3.0 and Claude-4.5.
Key Findings
- Surrogate dependency is strong: homogeneous surrogate-target pairings consistently yield higher Attack Success Rate and Average Toxicity than heterogeneous pairings, demonstrating transfer fragility.
- Mosaic attains state-of-the-art performance on closed-source targets: reported AvgTox/ASR on GPT-4o, Gemini-3.0 and Claude-4.5 are 4.13/69.66%, 4.04/70.73% and 3.99/67.01% respectively, improving ASR over the JPS baseline by 18.83, 24.26 and 21.38 percentage points on those models.
- Ablation studies show all three components contribute: removing text transformation, adversarial image optimisation, multi-view optimisation or ensemble guidance degrades AvgTox and ASR across targets, and full three-surrogate ensembles outperform single- or dual-surrogate settings.
- Hyperparameter sensitivity: a moderate crop range (reported best at [0.5,0.9]) and a moderate perturbation budget (best at epsilon=32 before declining) balance local focus and view variation for optimal transfer.
- Defences and robustness: simple image-space preprocessing reduces but does not eliminate success; JPEG compression and Gaussian blur lower ASR and AvgTox, with Gaussian blur producing the larger drop. Mosaic remains effective under these simple defences and is query-efficient, requiring on average 4.63 to 5.43 target queries per sample and 2.32 to 3.21 queries on successful attacks.
Limitations
The study is limited to the surrogate set, affirmation-prefix objective and specific optimisation hyperparameters examined; results depend on the chosen surrogate models and judge. Simple image preprocessing weakens but does not fully block attacks, and stronger or adaptive defences were not comprehensively tested. Evaluation relies on an automated judge, though cross-judge validation was used to check consistency. The approach assumes access to multiple surrogate VLMs and that transformed queries preserve intent.
Implications
From an offensive-security perspective, attackers can increase the transferability of subtle, low-visibility image perturbations by combining text-side manipulations, multi-view optimisation and ensemble guidance from multiple surrogates, enabling more reliable jailbreaks of closed-source VLMs across architectures and safety regimes. Such attacks can circumvent standard refusal triggers, produce more actionable harmful outputs and remain partly robust to simple image-space preprocessing, posing acute risks to content moderation, safety prompts and deployed VLM guardrails.