Cross-Modal Backdoors Exploit Multimodal LLM Connectors
Attacks
Multimodal Large Language Models (MLLMs) are often stitched together from pretrained parts: an image or audio encoder, a language model (LLM), and a small “connector” that maps one latent space to another. That little connector looks innocuous. This paper shows it can carry a big, quiet knife.
How the backdoor works
The attack poisons only the connector. The adversary fine-tunes it with a single seed example plus augmented variants from one modality, while freezing the encoder and the LLM. A weighted language-modelling objective and regularisation keep benign behaviour mostly intact. The goal is to carve out a compact region in the post-connector latent space that the LLM will map to a fixed malicious target output.
Next comes the elegant bit: the authors extract a “malicious centroid” from those poisoned representations, carefully decoupling direction and magnitude so the anchor is stable. At inference, they do input-side optimisation to steer samples from other modalities toward that centroid under bounded perturbations. No repeated API probing, no full-model access. A projected gradient descent objective combining cosine similarity and L2 distance does the steering.
Why it lands
On connector-based systems like PandaGPT and NExT-GPT, the results are brutally effective. Same-modality activation hits up to 99.9% exact attack success rate (ASR). Most cross-modal cases exceed 95.0% ASR under bounded perturbations. Feature-space distances drop by about 86.7% on average, with final cosine similarity to the centroid at least 0.952 and often above 0.99.
It is stealthy too. Clean inputs show negligible leakage (reported as 0.0%). Utility barely shifts, with at most 1.4% reductions in benchmark metrics. Connector parameters remain near-benign as well, with weight-cosine similarity above 0.97 compared to a clean connector. That makes supply-chain compromise plausible: a tainted third-party connector can look and perform like the real thing until nudged by a crafted input from any modality.
Quick fixes struggle. Light connector fine-tuning or moderate pruning dent ASR but do not kill it; aggressive pruning that does suppress ASR also wrecks utility (example: 40% pruning caused about an 11% absolute utility drop). Input transformations block naïve attacks, yet adaptive optimisation largely bypasses them and harms performance.
The study sticks to modular MLLMs and a fixed target phrase, so it is not the last word. But it exposes a tidy, reusable backdoor path in multimodal alignment. Open questions loom: can we audit connectors for compact malicious clusters, attest supply-chain integrity, or alter training so no single module can mint a cross-modal latent “shortcut” so easily?
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Cross-Modal Backdoors in Multimodal Large Language Models
🔍 ShortSpan Analysis of the Paper
Problem
The paper investigates a supply-chain backdoor threat in multimodal large language models built by assembling pretrained components. It focuses on the lightweight connector module that maps encoder outputs into the language model latent space. The study shows that poisoning only this connector with data from a single modality can create a compact malicious latent region that remains dormant on clean inputs but can be activated later from other, unpoisoned modalities. This expands the threat beyond prior work that required poisoning large encoders or the LLM and that typically only supported same-modality triggers.
Approach
The authors propose a three-phase attack: (1) poison the connector by fine-tuning it on a single seed sample plus augmented variants from one modality while keeping the encoder and LLM frozen, using a weighted language-modelling loss with regularisation to preserve benign behaviour; (2) extract a malicious centroid from the compact cluster of poisoned post-connector representations by decoupling direction and magnitude to form a stable latent anchor; (3) perform input-side adversarial optimisation at inference to steer inputs from other modalities toward that centroid under bounded perturbations, using a combined cosine and L2 objective solved by PGD. Experiments use connector-based MLLMs such as PandaGPT and NExT-GPT, with datasets drawn from MS COCO for images and captions and Clotho for audio. Metrics include exact and relaxed attack success rate (ASR), cosine similarity to the centroid, feature-space distance, cross-modal reachability rate, downstream utility metrics, leakage on clean inputs and connector parameter drift.
Key Findings
- The attack is highly effective: native same-modality activation yields up to 99.9% exact ASR, and most cross-modal settings exceed 95.0% ASR under bounded perturbations.
- Latent steering is achievable: PGD moves representations from diverse modalities to final cosine similarity at least 0.952 and often above 0.99, with feature-space distances reduced by about 86.7% on average.
- The attack is stealthy: poisoned connectors produce negligible leakage on clean inputs (0.0% leakage reported), preserve benign utility (metric shifts minimal; utility reductions at most 1.4% reported), and maintain weight-cosine similarity above 0.97 relative to benign connectors.
- Defences struggle: simple connector fine-tuning or moderate pruning reduces but does not eliminate ASR; aggressive pruning that suppresses ASR causes substantial utility degradation (example: 40% pruning led to about 11% absolute utility drop). Input transformations can block non-adaptive attacks but are largely bypassed by adaptive optimisation and also harm utility.
- Ablations show the combined cosine plus L2 activation objective and a modest poisoning rate (around 0.1) with multiple augmented variants produce stable malicious regions and best attack stability.
Limitations
The evaluation targets connector-based modular MLLMs and does not test native end-to-end unified architectures. Experiments use a fixed, benign target phrase as the payload rather than more complex or conditional malicious objectives. Defence evaluations focus on plausible, localised interventions and note that strong mitigations such as encoder or LLM fine-tuning are expensive and often impractical in PEFT-style deployments. These constraints limit scope but do not contradict the existence of the vulnerability.
Implications
An attacker supplying a poisoned connector can implant a reusable latent-space backdoor that is activatable from modalities that were never poisoned, enabling cross-modal covert activation via constrained input perturbations. In practice this enables targeted output hijacking through standard APIs without repeated queries or full-model access. The attack's stealth and resistance to common defences raise supply-chain risks for modular multimodal systems and suggest attackers could exploit third-party connector distribution channels to compromise many downstream deployments.