Weak Defences Enable Stronger VLM Jailbreaks

Attacks

Published: Wed, Sep 17, 2025 • By Lydia Stratus

Weak Defences Enable Stronger VLM Jailbreaks

A new study shows attackers can turn weak safety cues into more effective jailbreaks of Vision-Language Models (VLMs). Combining a visual optimiser, a defence-styled textual optimiser and a red-team suffix generator produces single-shot bypasses, raising the bar for red teams and the urgency for defence-in-depth in production deployments.

Lede: A new paper demonstrates that weak safety mechanisms can be repurposed to make jailbreak attacks against Vision-Language Models (VLMs) both stronger and faster. The attack pipeline named Defence2Attack combines image perturbations, defence-styled text rewriting and a red-team suffix generator to achieve high single-shot success rates.

Nut graf: For SREs and security teams this matters because it exposes a realistic attack pattern that exploits defensive signals rather than simply breaking them. The study evaluates open source VLMs and one commercial model and reports around 80% attack success rate on open source systems and about 50% on the commercial target in single attempts, demonstrating practical risk to deployed services.

Background

Vision-Language Models merge image and text understanding to answer questions and generate content. Previous jailbreaks required many attempts or relied on fragile tricks. This work shows attackers can fold weak defensive design choices into the attack process to guide the model towards harmful outputs.

How the attack works

The pipeline has three parts. The visual optimiser embeds universal adversarial perturbations with affirmative, encouraging semantics. The textual optimiser rewrites prompts in a defence-like tone to disguise intent. A suffix generator, fine-tuned with reinforcement techniques, adds a short red-team prompt that boosts the jailbreak in one shot.

Impact and risk: Using defensive patterns as attack signals changes the threat model. Defences that are shallow or predictable may actually aid attackers. Single-shot effectiveness increases the chance of automated or scaled abuse and reduces opportunity for early detection.

Vendor and industry response: The paper evaluates both open source models and a commercial offering but does not document vendor fixes. The findings argue for defence-aware testing across models and datasets rather than relying on single-layer safeguards.

Mitigations and next steps: Practitioners should implement defence-in-depth, add adversarial training that covers both text and image perturbations, and include defence-to-attack scenarios in red-team tests. Operational controls such as stricter rate-limiting, prompt logging, anomaly detection and human review on risky outputs reduce exposure.

Immediate ops: run defence-aware red teams, enable stricter rate limits, log and review multimodal prompts.

Limitations: The study uses a white-box threat model and an automated judge for harmful outputs, which may affect generalisability to some production environments. Forward look: Close, multimodal testing and robust layered defences are the practical route to reducing the new attack surface exposed by Defence2Attack.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Authors: Yunhan Zhao, Xiang Zheng, and Xingjun Ma

Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.

🔍 ShortSpan Analysis of the Paper

Problem

Vision Language Models (VLMs) combine visual and textual understanding to generate outputs, but remain vulnerable to jailbreak attacks that bypass safety guardrails. Prior work shows progress in jailbreak methods, yet they often struggle with effectiveness and efficiency. This paper reveals that introducing weak defensive patterns into the attack pipeline can substantially enhance jailbreak success, and introduces Defense2Attack, a novel bimodal jailbreak that leverages defensive cues to guide jailbreak prompt design. The study aims to stress test VLM safety by showing how weak defenses can be repurposed to improve attack performance, underscoring the need for defence in depth and defence aware evaluation in AI safety.

Approach

Defense2Attack combines three components: a semantic positive visual optimiser that embeds universal adversarial perturbations with affirmative semantics to encourage the model to respond, a defence styled textual optimiser that rewrites the input to disguise jailbreak intent, and an LLM based red team suffix generator that augments the jailbreak through reinforcement fine tuning. The visual optimiser uses a corpus of positive sentences to maximise the target model’s likelihood of generating harmful outputs, solving a constrained optimisation with an ℓ infinity perturbation bound via projected gradient descent. The textual optimiser applies a defence styled template that superficially adds safety cues yet aids in misdirecting detection, employing chain of thought reasoning to refine strategies. The red team suffix generator appends a fixed length suffix to the textual prompt, with a third party judge (GPT-4o) providing a binary harmfulness signal used as the reward for reinforcement fine tuning. The suffix generator is fine tuned from a GPT-2 base using Proximal Policy Optimisation. The tuning data come from prompts produced by the visual and textual optimisers across multiple jailbreaking topics, and the method is evaluated on four VLMs (three open source: LLaVA, MiniGPT-4, InstructionBLIP; one commercial: Gemini) and four safety benchmarks (AdvBench, MM SafetyBench, RedTeam 2K and Harmful Instructions). The primary metric is Attack Success Rate (ASR) judged by GPT-4o. Implementations include a 5000 step PGD visual optimisation with 32/255 bound, GPT-4o for textual refinement, and PPO tuned GPT-2 for suffix generation. Training spans around 200 epochs to reach a target score. Baselines include image based attacks, text based attacks and other bimodal methods, with a focus on single shot performance.

Key Findings

Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state of the art methods that typically require multiple tries.
The method attains around 80% ASR on open source VLMs and around 50% ASR on commercial Gemini, demonstrating strong effectiveness and efficiency benefits.
Transferability is shown in two directions: cross model to Gemini yields substantial ASR improvements, with over 45% on Harmful Instructions, 38% on AdvBench and 53% on MM SafetyBench; cross dataset transfer to RedTeam 2K shows Defence2Attack outperforming the vanilla attack and BAP, with gains including 2.6% on LLaVA and up to 55.10% ASR on Gemini in certain settings.
Ablation studies indicate the Visual Optimiser and Textual Optimiser each contribute significantly, while the Suffix Generator enhances performance when trained on optimised prompts. The Visual Optimiser alone is effective but less so than the combination of Textual Optimiser and Suffix Generator; the combination of Visual Optimiser and Textual Optimiser generally outperforms BAP, yet the Textual Optimiser plus Suffix Generator can surpass the Visual plus Textual combination in certain settings.
The method demonstrates exceptional efficiency, achieving strong jailbreak results with a single prompt and suffix, highlighting its advantage over multi attempt baselines.

Limitations

The study operates under a white box threat model with full access to pretrained model weights but no system prompts or training data, and uses GPT-4o as a judge to determine harmful outputs. Evaluations cover four VLMs and four safety benchmarks, with one shot prompts and a red team suffix generator trained on MM SafetyBench data, which may limit generalisability to some real world conditions. While cross model and cross dataset transfers are demonstrated, results rely on the presented evaluation framework and may differ with alternative judges or safety criteria.

Why It Matters

The work reveals that weak defence components can be exploited to craft stronger jail breaks, underscoring the importance of defence in depth and defence aware evaluation in AI safety systems. Practically, it suggests that red team style validation and robust evaluation across modalities and datasets are essential to close gaps in VLM safety. Societally, successful bypasses of safety guardrails could enable harmful content generation or manipulation through automated platforms, emphasising the need for stronger mitigations and red team based testing to build more resilient multimodal models.

Attribution Original paper on arXiv