ShortSpan.ai logo Home

Multimodal Models Enable Cross-Modal Adversarial Attacks

Attacks
Published: Wed, May 07, 2025 • By Natalie Kestrel
Multimodal Models Enable Cross-Modal Adversarial Attacks
A practitioner survey maps how adversaries exploit multimodal systems that process text, images, video and audio. It finds optimisation-based perturbations, backdoors and privacy attacks carry across modalities, creating multi-stage chains that evade piecemeal defences. The paper warns open-source deployments expand the attack surface and recommends input sanitisation, robust training and runtime checks.

The academic survey under review brings together a messy but important truth: multimodal models inherit the weaknesses of every modality they touch. Systems trained to understand text, images, video and audio are appealing for real-world products, and open-source releases make them easy to adopt. The paper maps attack types in practitioner language and points out what a risk-aware team needs to consider before shipping a multimodal pipeline.

Threats across modalities

Researchers classify attacks into a few practical buckets. Optimisation-based perturbations tweak inputs so the model misbehaves without obvious changes to a human observer. Data poisoning and backdoors plant triggers during training so a benign-seeming input later activates a harmful response. Privacy attacks such as membership inference and model inversion can leak training data. The survey documents how these familiar attacks appear in text, image, video and audio and highlights one inconvenient detail: they do not stay in their lane.

Multimodal systems often use joint embeddings, so a crafted image can push a text-conditioned model off course, or an adversarial audio clip can mess with a video captioning pipeline. That cross-modal transfer expands the attack surface and enables multi-stage chains where one compromise facilitates the next. Video, the paper notes, is comparatively under-studied for privacy attacks, which is a practical blind spot given increasing use of automated video analysis for surveillance and compliance checks.

Another sober finding is tool fragmentation. There is a lot of defensive advice—input sanitisation, robust training, detection at inference—but the defence literature is spread across separate modality communities. Practical tooling that integrates attacks and defences for multimodal systems is sparse, particularly for open-source stacks that many teams rely on.

Practical security lessons

The survey does not offer a silver bullet, and it is careful about the limits of current solutions. Robust training and regularisation can reduce sensitivity to small perturbations, but certified robustness techniques are incomplete for whole multimodal pipelines. Input-level sanitisation and preprocessing are useful stops, and inference-time detectors and monitoring help catch unusual behaviour, but detectors are bypassable and add operational complexity.

For security teams, the real value of the paper is a structured threat model: treat each modality as a separate attack surface and then model how an attacker might chain compromises through joint embeddings or prompt interfaces. Pay particular attention to training provenance when using open-source models and to data curation when fine-tuning. Finally, monitor decisions across modalities for consistency rather than treating outputs independently.

Three checks teams can run now:

  • Run adversarial input tests against each modality and test cross-modal transfer by attacking one modality and observing another.
  • Audit training data and third-party checkpoints for poisoning indicators and enforce provenance and access controls.
  • Add multimodal fusion checks at inference to detect inconsistent cross-modal signals and log discrepancies for immediate triage.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Adversarial Attacks in Multimodal Systems: A Practitioner's Survey

Authors: Shashank Kapoor, Sanjay Surendranath Girija, Lakshit Arora, Dipen Pradhan, Ankit Shetgaonkar, and Aman Raj
The introduction of multimodal models is a huge step forward in Artificial Intelligence. A single model is trained to understand multiple modalities: text, image, video, and audio. Open-source multimodal models have made these breakthroughs more accessible. However, considering the vast landscape of adversarial attacks across these modalities, these models also inherit vulnerabilities of all the modalities, and ultimately, the adversarial threat amplifies. While broad research is available on possible attacks within or across these modalities, a practitioner-focused view that outlines attack types remains absent in the multimodal world. As more Machine Learning Practitioners adopt, fine-tune, and deploy open-source models in real-world applications, it's crucial that they can view the threat landscape and take the preventive actions necessary. This paper addresses the gap by surveying adversarial attacks targeting all four modalities: text, image, video, and audio. This survey provides a view of the adversarial attack landscape and presents how multimodal adversarial threats have evolved. To the best of our knowledge, this survey is the first comprehensive summarization of the threat landscape in the multimodal world.

🔍 ShortSpan Analysis of the Paper

Problem

Multimodal models that understand text, images, video and audio have become increasingly capable through open source releases. This democratisation lowers barriers for deployment but also expands the potential attack surface, since vulnerabilities from each modality can be carried into a multimodal system and may interact across modalities. A practitioner oriented survey detailing the specific attack types in the multimodal setting was previously lacking. As practitioners deploy, fine tune and operate open source models in real world applications, it is essential to view the threat landscape and implement preventive actions. This paper fills that gap by surveying adversarial attacks targeting text, image, video and audio, synthesising how multimodal threats have evolved and providing a comprehensive view of the threat landscape to support threat modelling and defence design. It also highlights cross modality and multi stage attack chains, emphasises deployment considerations with open source models, and points to concrete mitigations such as robust training, input validation, detection and multimodal fusion checks. The authors stress societal risks from multimodal manipulation that can affect surveillance, information integrity and automated decision systems, underscoring security and governance implications.

Approach

The article surveys attack methods across text, image, video and audio, providing a threat taxonomy based on attacker knowledge (white box versus black box), intention (untargeted versus targeted) and execution. It outlines four attack classes—optimisation based perturbations, data poisoning and backdoors, membership inference and model inversion—and discusses how these manifest in each modality and in cross modal settings. It describes typical perturbation techniques in text (character, token and word level; gradient based methods; rule based and heuristic approaches), image and video perturbations (norm based L0 L1 L2 and L inf, geometric and patch modifications, frame selection for videos; cross modal embedding space attacks), and audio perturbations (psychoacoustic masking and backdoor type attacks). It also covers backdoor data poisoning, clean label attacks, and embedding space based triggers. The survey summarises evidence of transferability and multi stage attacks, notes the prominence of optimisation based approaches, and documents deployment considerations for open source systems, emphasising that tooling is fragmented and that there is a need to integrate attacks and defenses into open source platforms. Finally it presents a pragmatic three stage defence outline for practitioners: modify the input used for training to remove perturbations, modify the training process to improve robustness, and modify inference with preprocessing, detectors and post inference monitoring, aided by explanations where appropriate. It also notes that certified robustness ideas are emerging but remain incomplete for a holistic multimodal defence and outlines future work including a defence framework specifically for multimodal systems.

Key Findings

  • Optimization based attacks are the most studied across modalities, with white box and black box variants; perturbations can be imperceptible or targeted across text, image, video and audio.
  • Backdoor and data poisoning attacks are widely discussed across modalities, including clean label techniques and triggers that activate misbehaviour during training or inference, often exploiting embedding spaces or prompts to influence multimodal outputs.
  • Privacy attacks such as membership inference and model inversion have substantial evidence in text and image domains and emerging cross modal concerns, while video as a modality remains comparatively under studied for these attacks.
  • Cross modal threats exploit relationships between modalities via joint embeddings, enabling attacks in one modality to influence others and to degrade or misalign multimodal outputs.
  • Defence literature is fragmented; practitioners are advised to use input level sanitisation, robust training and regularisation, and inference time detection and monitoring, but there is no comprehensive multimodal defence framework and tooling is not consistently up to date.
  • Real world deployment requires threat modelling and monitoring across modalities, with growing societal and governance implications for surveillance, information integrity and automated decision systems.

Limitations

The survey is constrained by the breadth of the literature and the pace of new attacks, with particular gaps in video modality coverage for privacy attacks such as membership inference and model inversion. Defence strategies are still fragmented and there is no holistic framework for multimodal systems; open source tooling has not kept pace with evolving attack types. The authors note a need for integration of attacks and defenses into community platforms to better support practitioners.

Why It Matters

For practitioners, the survey provides a consolidated view of threat types across text, image, video and audio to support threat modelling and defensive design. It emphasises cross modal effects and the need for safeguards implemented across the training pipeline and at inference, including input validation, robust training, detection and fusion checks. The work highlights societal risks from multimodal manipulation and the governance implications for surveillance and automated decision making, and calls for continued development of defence frameworks and accessible open source tooling to help the security community manage these interconnected risks.


← Back to Latest