Cross-Modal Backdoors Exploit Multimodal LLM Connectors

Attacks

Published: Mon, May 11, 2026 • By Marcus Halden

Cross-Modal Backdoors Exploit Multimodal LLM Connectors

New research shows a single poisoned connector in a modular multimodal Large Language Model can plant a backdoor that triggers across modalities. Using one seed sample and lightweight fine-tuning, the attack steers inputs toward a latent “centroid”, hitting over 95% cross-modal success while keeping utility and weights near-benign.

Multimodal Large Language Models (MLLMs) are often stitched together from pretrained parts: an image or audio encoder, a language model (LLM), and a small “connector” that maps one latent space to another. That little connector looks innocuous. This paper shows it can carry a big, quiet knife.

How the backdoor works

The attack poisons only the connector. The adversary fine-tunes it with a single seed example plus augmented variants from one modality, while freezing the encoder and the LLM. A weighted language-modelling objective and regularisation keep benign behaviour mostly intact. The goal is to carve out a compact region in the post-connector latent space that the LLM will map to a fixed malicious target output.

Next comes the elegant bit: the authors extract a “malicious centroid” from those poisoned representations, carefully decoupling direction and magnitude so the anchor is stable. At inference, they do input-side optimisation to steer samples from other modalities toward that centroid under bounded perturbations. No repeated API probing, no full-model access. A projected gradient descent objective combining cosine similarity and L2 distance does the steering.

Why it lands

On connector-based systems like PandaGPT and NExT-GPT, the results are brutally effective. Same-modality activation hits up to 99.9% exact attack success rate (ASR). Most cross-modal cases exceed 95.0% ASR under bounded perturbations. Feature-space distances drop by about 86.7% on average, with final cosine similarity to the centroid at least 0.952 and often above 0.99.

It is stealthy too. Clean inputs show negligible leakage (reported as 0.0%). Utility barely shifts, with at most 1.4% reductions in benchmark metrics. Connector parameters remain near-benign as well, with weight-cosine similarity above 0.97 compared to a clean connector. That makes supply-chain compromise plausible: a tainted third-party connector can look and perform like the real thing until nudged by a crafted input from any modality.

Quick fixes struggle. Light connector fine-tuning or moderate pruning dent ASR but do not kill it; aggressive pruning that does suppress ASR also wrecks utility (example: 40% pruning caused about an 11% absolute utility drop). Input transformations block naïve attacks, yet adaptive optimisation largely bypasses them and harms performance.

The study sticks to modular MLLMs and a fixed target phrase, so it is not the last word. But it exposes a tidy, reusable backdoor path in multimodal alignment. Open questions loom: can we audit connectors for compact malicious clusters, attest supply-chain integrity, or alter training so no single module can mint a cross-modal latent “shortcut” so easily?

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Cross-Modal Backdoors in Multimodal Large Language Models

Authors: Runhe Wang, Li Bai, Haibo Hu, and Songze Li

Developers increasingly construct multimodal large language models (MLLMs) by assembling pretrained components,introducing supply-chain attack surfaces.Existing security research primarily focuses on poisoning backbones such as encoders or large language models (LLMs),while the security risks of lightweight connectors remain unexplored.In this work,we propose a novel cross-modal backdoor attack that exploits this overlooked vulnerability.By poisoning only the connector using a single seed sample and several augmented variants from one modality,the adversary can subsequently activate the backdoor using inputs from other modalities.To achieve this,we first poison the connector to associate a compact latent region with a malicious target output.To activate the backdoor from other modalities,we further extract a malicious centroid from the poisoned latent representations and perform input-side optimization to steer inputs toward this latent anchor,without requiring repeated API queries or full-model access.Extensive evaluations on representative connector-based MLLM architectures,including PandaGPT and NExT-GPT,demonstrate both the effectiveness and cross-modal transferability of the proposed attack.The attack achieves up to 99.9% attack success rate (ASR) in same-modality settings,while most cross-modal settings exceed 95.0% ASR under bounded perturbations.Moreover,the attack remains highly stealthy,producing negligible leakage on clean inputs,and maintaining weight-cosine similarity above 0.97 relative to benign connectors.We further show that existing defense strategies fail to effectively mitigate this threat without incurring substantial utility degradation.These findings reveal a fundamental vulnerability in multimodal alignment: a single compromised connector can establish a reusable latent-space backdoor pathway across modalities,highlighting the need for safer modular MLLM design.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates a supply-chain backdoor threat in multimodal large language models built by assembling pretrained components. It focuses on the lightweight connector module that maps encoder outputs into the language model latent space. The study shows that poisoning only this connector with data from a single modality can create a compact malicious latent region that remains dormant on clean inputs but can be activated later from other, unpoisoned modalities. This expands the threat beyond prior work that required poisoning large encoders or the LLM and that typically only supported same-modality triggers.

Approach

The authors propose a three-phase attack: (1) poison the connector by fine-tuning it on a single seed sample plus augmented variants from one modality while keeping the encoder and LLM frozen, using a weighted language-modelling loss with regularisation to preserve benign behaviour; (2) extract a malicious centroid from the compact cluster of poisoned post-connector representations by decoupling direction and magnitude to form a stable latent anchor; (3) perform input-side adversarial optimisation at inference to steer inputs from other modalities toward that centroid under bounded perturbations, using a combined cosine and L2 objective solved by PGD. Experiments use connector-based MLLMs such as PandaGPT and NExT-GPT, with datasets drawn from MS COCO for images and captions and Clotho for audio. Metrics include exact and relaxed attack success rate (ASR), cosine similarity to the centroid, feature-space distance, cross-modal reachability rate, downstream utility metrics, leakage on clean inputs and connector parameter drift.

Key Findings

The attack is highly effective: native same-modality activation yields up to 99.9% exact ASR, and most cross-modal settings exceed 95.0% ASR under bounded perturbations.
Latent steering is achievable: PGD moves representations from diverse modalities to final cosine similarity at least 0.952 and often above 0.99, with feature-space distances reduced by about 86.7% on average.
The attack is stealthy: poisoned connectors produce negligible leakage on clean inputs (0.0% leakage reported), preserve benign utility (metric shifts minimal; utility reductions at most 1.4% reported), and maintain weight-cosine similarity above 0.97 relative to benign connectors.
Defences struggle: simple connector fine-tuning or moderate pruning reduces but does not eliminate ASR; aggressive pruning that suppresses ASR causes substantial utility degradation (example: 40% pruning led to about 11% absolute utility drop). Input transformations can block non-adaptive attacks but are largely bypassed by adaptive optimisation and also harm utility.
Ablations show the combined cosine plus L2 activation objective and a modest poisoning rate (around 0.1) with multiple augmented variants produce stable malicious regions and best attack stability.

Limitations

The evaluation targets connector-based modular MLLMs and does not test native end-to-end unified architectures. Experiments use a fixed, benign target phrase as the payload rather than more complex or conditional malicious objectives. Defence evaluations focus on plausible, localised interventions and note that strong mitigations such as encoder or LLM fine-tuning are expensive and often impractical in PEFT-style deployments. These constraints limit scope but do not contradict the existence of the vulnerability.

Implications

An attacker supplying a poisoned connector can implant a reusable latent-space backdoor that is activatable from modalities that were never poisoned, enabling cross-modal covert activation via constrained input perturbations. In practice this enables targeted output hijacking through standard APIs without repeated queries or full-model access. The attack's stealth and resistance to common defences raise supply-chain risks for modular multimodal systems and suggest attackers could exploit third-party connector distribution channels to compromise many downstream deployments.

Links Original paper on arXiv

Cross-Modal Backdoors Exploit Multimodal LLM Connectors

How the backdoor works

Why it lands

📋 Original Paper Title and Abstract

Cross-Modal Backdoors in Multimodal Large Language Models

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Embedding Poisoning Bypasses LLM Safety Checks

Small poisoned sets can hijack large LLMs

Hidden visual prompts steer multimodal LLMs

Related Research

Get the Weekly AI Security Digest