Weak Defences Enable Stronger VLM Jailbreaks
Attacks
Lede: A new paper demonstrates that weak safety mechanisms can be repurposed to make jailbreak attacks against Vision-Language Models (VLMs) both stronger and faster. The attack pipeline named Defence2Attack combines image perturbations, defence-styled text rewriting and a red-team suffix generator to achieve high single-shot success rates.
Nut graf: For SREs and security teams this matters because it exposes a realistic attack pattern that exploits defensive signals rather than simply breaking them. The study evaluates open source VLMs and one commercial model and reports around 80% attack success rate on open source systems and about 50% on the commercial target in single attempts, demonstrating practical risk to deployed services.
Background
Vision-Language Models merge image and text understanding to answer questions and generate content. Previous jailbreaks required many attempts or relied on fragile tricks. This work shows attackers can fold weak defensive design choices into the attack process to guide the model towards harmful outputs.
How the attack works
The pipeline has three parts. The visual optimiser embeds universal adversarial perturbations with affirmative, encouraging semantics. The textual optimiser rewrites prompts in a defence-like tone to disguise intent. A suffix generator, fine-tuned with reinforcement techniques, adds a short red-team prompt that boosts the jailbreak in one shot.
Impact and risk: Using defensive patterns as attack signals changes the threat model. Defences that are shallow or predictable may actually aid attackers. Single-shot effectiveness increases the chance of automated or scaled abuse and reduces opportunity for early detection.
Vendor and industry response: The paper evaluates both open source models and a commercial offering but does not document vendor fixes. The findings argue for defence-aware testing across models and datasets rather than relying on single-layer safeguards.
Mitigations and next steps: Practitioners should implement defence-in-depth, add adversarial training that covers both text and image perturbations, and include defence-to-attack scenarios in red-team tests. Operational controls such as stricter rate-limiting, prompt logging, anomaly detection and human review on risky outputs reduce exposure.
- Immediate ops: run defence-aware red teams, enable stricter rate limits, log and review multimodal prompts.
Limitations: The study uses a white-box threat model and an automated judge for harmful outputs, which may affect generalisability to some production environments. Forward look: Close, multimodal testing and robust layered defences are the practical route to reducing the new attack surface exposed by Defence2Attack.
Additional analysis of the original ArXiv paper
π Original Paper Title and Abstract
Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models
π ShortSpan Analysis of the Paper
Problem
Vision Language Models (VLMs) combine visual and textual understanding to generate outputs, but remain vulnerable to jailbreak attacks that bypass safety guardrails. Prior work shows progress in jailbreak methods, yet they often struggle with effectiveness and efficiency. This paper reveals that introducing weak defensive patterns into the attack pipeline can substantially enhance jailbreak success, and introduces Defense2Attack, a novel bimodal jailbreak that leverages defensive cues to guide jailbreak prompt design. The study aims to stress test VLM safety by showing how weak defenses can be repurposed to improve attack performance, underscoring the need for defence in depth and defence aware evaluation in AI safety.
Approach
Defense2Attack combines three components: a semantic positive visual optimiser that embeds universal adversarial perturbations with affirmative semantics to encourage the model to respond, a defence styled textual optimiser that rewrites the input to disguise jailbreak intent, and an LLM based red team suffix generator that augments the jailbreak through reinforcement fine tuning. The visual optimiser uses a corpus of positive sentences to maximise the target modelβs likelihood of generating harmful outputs, solving a constrained optimisation with an β infinity perturbation bound via projected gradient descent. The textual optimiser applies a defence styled template that superficially adds safety cues yet aids in misdirecting detection, employing chain of thought reasoning to refine strategies. The red team suffix generator appends a fixed length suffix to the textual prompt, with a third party judge (GPT-4o) providing a binary harmfulness signal used as the reward for reinforcement fine tuning. The suffix generator is fine tuned from a GPT-2 base using Proximal Policy Optimisation. The tuning data come from prompts produced by the visual and textual optimisers across multiple jailbreaking topics, and the method is evaluated on four VLMs (three open source: LLaVA, MiniGPT-4, InstructionBLIP; one commercial: Gemini) and four safety benchmarks (AdvBench, MM SafetyBench, RedTeam 2K and Harmful Instructions). The primary metric is Attack Success Rate (ASR) judged by GPT-4o. Implementations include a 5000 step PGD visual optimisation with 32/255 bound, GPT-4o for textual refinement, and PPO tuned GPT-2 for suffix generation. Training spans around 200 epochs to reach a target score. Baselines include image based attacks, text based attacks and other bimodal methods, with a focus on single shot performance.
Key Findings
- Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state of the art methods that typically require multiple tries.
- The method attains around 80% ASR on open source VLMs and around 50% ASR on commercial Gemini, demonstrating strong effectiveness and efficiency benefits.
- Transferability is shown in two directions: cross model to Gemini yields substantial ASR improvements, with over 45% on Harmful Instructions, 38% on AdvBench and 53% on MM SafetyBench; cross dataset transfer to RedTeam 2K shows Defence2Attack outperforming the vanilla attack and BAP, with gains including 2.6% on LLaVA and up to 55.10% ASR on Gemini in certain settings.
- Ablation studies indicate the Visual Optimiser and Textual Optimiser each contribute significantly, while the Suffix Generator enhances performance when trained on optimised prompts. The Visual Optimiser alone is effective but less so than the combination of Textual Optimiser and Suffix Generator; the combination of Visual Optimiser and Textual Optimiser generally outperforms BAP, yet the Textual Optimiser plus Suffix Generator can surpass the Visual plus Textual combination in certain settings.
- The method demonstrates exceptional efficiency, achieving strong jailbreak results with a single prompt and suffix, highlighting its advantage over multi attempt baselines.
Limitations
The study operates under a white box threat model with full access to pretrained model weights but no system prompts or training data, and uses GPT-4o as a judge to determine harmful outputs. Evaluations cover four VLMs and four safety benchmarks, with one shot prompts and a red team suffix generator trained on MM SafetyBench data, which may limit generalisability to some real world conditions. While cross model and cross dataset transfers are demonstrated, results rely on the presented evaluation framework and may differ with alternative judges or safety criteria.
Why It Matters
The work reveals that weak defence components can be exploited to craft stronger jail breaks, underscoring the importance of defence in depth and defence aware evaluation in AI safety systems. Practically, it suggests that red team style validation and robust evaluation across modalities and datasets are essential to close gaps in VLM safety. Societally, successful bypasses of safety guardrails could enable harmful content generation or manipulation through automated platforms, emphasising the need for stronger mitigations and red team based testing to build more resilient multimodal models.