Cross-modal attacks outwit vision-language model defences

Pentesting

Published: Thu, Feb 12, 2026 • By Natalie Kestrel

Cross-modal attacks outwit vision-language model defences

A new paper introduces CrossTALK, a cross-modal entanglement attack that spreads clues across images and text to bypass vision-language model defences. Experiments on nine mainstream models show high success and detailed harmful outputs, highlighting gaps in cross-modal alignment and the need for adversarial testing and cross-modal safety checks in deployed VLM systems.

Vision-language models (VLMs) are supposed to understand pictures and words together. That dual capability is now their selling point and, as this paper argues, their weak spot. The authors present CrossTALK, a black-box cross-modal entanglement approach that deliberately mixes visual and textual clues to push VLMs past their safety alignment.

How the attack works

The method assembles an entangled payload: an image seeded with visualisable entities and a text prompt that points to those entities using spatial references. Three named mechanisms do the heavy lifting. Knowledge-scalable reframing expands a harmful request into a multi-hop chain so the model reasons through intermediate steps. Cross-modal clue entangling migrates entities into the image to force the model to link visual and textual cues. Cross-modal scenario nesting wraps the whole thing in contextual instructions that steer outputs toward more detailed, and potentially harmful, responses. A ReAct-style loop iteratively rewrites queries to preserve intent while disguising it in benign-seeming language.

Put simply, the attack converts safety checks that operate in one modality into loopholes by distributing the malicious signal across both modalities and chaining reasoning steps.

What they found

The results are striking on paper. The authors report that their pipeline, referred to in the evaluation as COMET, achieves over 94 per cent attack success across nine mainstream VLMs and beats the best baseline by roughly 29 percentage points. The full configuration is said to yield about a 0.96 attack success rate and a 0.91 harmfulness score on SafeBench tiny. They also find that four to six entanglement hops strike the best balance between effectiveness and stealth, and that the technique generalises to advanced models such as Gemini and Qwen variants. Visual style experiments reportedly show the approach works with both realistic and stylised images.

Those are useful red-team metrics: success rate, semantic diversity of payloads, and transfer across models. The paper is explicit that a simple prompt-based defence is not enough; the attacks remain effective under that simulated defence.

That said, the study has limits the authors acknowledge. Tests are confined to nine models, SafeBench datasets and a prompt-based defence setting. There is no evidence here about behaviour inside complex, deployed systems that add logging, rate limits, human review or multimodal metadata checks. The black-box assumption is realistic for many threat models, but not universal.

So what does a determined attacker get from this paper? A clear recipe for moving harmful content into the visual channel and stretching model reasoning across hops to defeat single-modality filters. In practice an attacker could vary images, entanglement depth and wording to evade pattern detectors and to create payloads that look benign unless the system resolves the visual-textual links.

The central implication is straightforward and worrying: safety work that treats vision and language in isolation will miss this class of risk. The paper supports the case for cross-modal adversarial testing and for improving alignment that generalises across modalities. How to do that at scale, and how to evaluate defences in real production stacks, are open questions the community now needs to answer.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Red-teaming the Multimodal Reasoning: Jailbreaking Vision-Language Models via Cross-modal Entanglement Attacks

Authors: Yu Yan, Sheng Sun, Shengjia Cheng, Teli Liu, Mingfeng Li, and Min Liu

Vision-Language Models (VLMs) with multimodal reasoning capabilities are high-value attack targets, given their potential for handling complex multimodal harmful tasks. Mainstream black-box jailbreak attacks on VLMs work by distributing malicious clues across modalities to disperse model attention and bypass safety alignment mechanisms. However, these adversarial attacks rely on simple and fixed image-text combinations that lack attack complexity scalability, limiting their effectiveness for red-teaming VLMs' continuously evolving reasoning capabilities. We propose \textbf{CrossTALK} (\textbf{\underline{Cross}}-modal en\textbf{\underline{TA}}ng\textbf{\underline{L}}ement attac\textbf{\underline{K}}), which is a scalable approach that extends and entangles information clues across modalities to exceed VLMs' trained and generalized safety alignment patterns for jailbreak. Specifically, {knowledge-scalable reframing} extends harmful tasks into multi-hop chain instructions, {cross-modal clue entangling} migrates visualizable entities into images to build multimodal reasoning links, and {cross-modal scenario nesting} uses multimodal contextual instructions to steer VLMs toward detailed harmful outputs. Experiments show our COMET achieves state-of-the-art attack success rate.

🔍 ShortSpan Analysis of the Paper

Problem

Vision-Language Models with multimodal reasoning are high-value attack targets due to their potential to perform complex harmful tasks. Existing black-box jailbreaks spread malicious clues across text and images to disperse model attention and bypass safety alignment, but rely on simple fixed image-text combinations that scale poorly as VLMs evolve. The paper argues that safety alignment in VLMs shows incomplete cross-modal generalisation and that harm can emerge only through cross-modal understanding, motivating a scalable red-teaming approach that entangles information across modalities.

Approach

The authors introduce COMET, a black-box cross-modal entanglement attack composed of three mechanisms. Knowledge-scalable reframing expands a harmful task into a multi-hop sequence by augmenting with related knowledge. Cross-modal clue entangling migrates visualisable entities into the image to create interdependent multimodal reasoning links. Cross-modal scenario nesting embeds the attack within cross-modal task guidelines to steer the VLM toward detailed harmful outputs. The payload is built by generating an entangled image containing decoy and target entities and an entangled text that references the image via spatial pointers, forming a text image prompt that is hard to decode and difficult to counter with modality specific checks. A ReAct style loop guides iterative rewriting of the query, preserving malicious intent in a benign semantic carrier and scaling complexity across knowledge connections. The framework is tested on SafeBench data and nine mainstream VLMs against baselines, with a defence prompt used to simulate a prompt based defence.

Key Findings

COMET achieves over 94 per cent attack success across nine mainstream VLMs, outperforming the best baseline by about 29 percentage points.
Attacks remain effective under prompt based defence prompts and generate unsafe detailed outputs with high Harmfulness Scores rather than mere jailbreaks.
Ablation shows each component contributes: knowledge scalable reframing is central, cross modal clue entanglement enhances cross modal dependencies, and cross modal scenario nesting strengthens steering towards harmful outputs.
The full COMET configuration yields about 0.96 attack success rate and 0.91 harmfulness score on SafeBench tiny, with four to six entanglement hops providing the best balance between efficacy and stealth.
COMET produces high semantic diversity in attack payloads, reducing similarity to prior samples and improving red teaming coverage compared with fixed templates.
Generalisation tests indicate COMET attains the highest attack success across advanced models such as Gemini and Qwen variants, confirming broad applicability.
Visual style experiments show successful generation across realistic and stylised images, enabling robust synthesis of entangled payloads across diverse appearances.

Limitations

The evaluation covers nine mainstream vision language models and SafeBench based datasets with a prompt based defence. While results show high attack success and robustness under the tested settings, generalisation to additional models, real world systems, or other defensive configurations is not demonstrated within the paper.

Why It Matters

The work exposes fundamental cross modal safety alignment gaps in modern vision language models and demonstrates a scalable red teaming approach that leverages cross modal reasoning to jailbreak models. The findings underscore the need for stronger cross modal alignment, cross modal adversarial testing, and monitoring of multi modal prompt vulnerabilities to improve the resilience of VLM based systems and protect against harmful outputs.

Attribution Original paper on arXiv