Joint audio-text attacks jailbreak spoken models
Pentesting
Spoken Language Models (SLMs) blend speech and text. That blend turns out to be more than the sum of its parts for attackers. A new study shows that joint attacks which perturb both modalities outperform single‑channel jailbreaks by a wide margin, suggesting many current safety checks are looking the wrong way.
What the researchers built
The team proposes JAMA, a joint audio‑text optimisation framework. It couples Greedy Coordinate Gradient (GCG) to craft a discrete text suffix with Projected Gradient Descent (PGD) to nudge the audio signal. The two are tuned together to maximise the chance of a harmful, compliant response across a batch of malicious prompts. A sequential approximation, SAMA, first tunes the text, then conditions the audio on that fixed suffix, delivering similar attack quality at a fraction of the compute.
They evaluate on four current SLMs: Audio Flamingo 3, Qwen2 Audio (7B, Instruct), Gemma 3N, and Qwen2.5 Omni (7B). Inputs span four audio types: audiobook read speech, two switchboard conversational samples, and music. Success is judged with string matching and LLaMA Guard 3 under greedy decoding.
What they found
Joint audio‑text optimisation consistently beats unimodal attacks, often by large factors. Depending on model and input, jailbreak success climbs by roughly 1.5x to 10x. Early in the optimisation, the text component drives most of the gains; the audio perturbation then adds steady improvements, a division of labour JAMA exploits well. PGD‑only audio attacks are generally weaker than text‑only, but the combination is the difference between sometimes failing and usually succeeding. Music initialisation is often the strongest base for audio perturbations. Gemma 3N resists some single‑modality setups yet still yields under certain joint configurations. SAMA trails JAMA slightly at small budgets but matches it closely when token and audio lengths grow, while running about 4x to 6x faster.
There are limits. The work assumes a white‑box setting with gradients available and differentiable audio front ends. Many deployed stacks insert non‑differentiable preprocessing, which can break gradient‑based methods. Results hinge on specific hyperparameters and use greedy decoding; behaviour varies with dataset and initialisation, and Gemma 3N is an outlier in parts of the study.
If this all sounds familiar, it is. Security has a long memory for multi‑format trouble. In the 1990s, defenders who scanned documents or executables in isolation were blindsided by macro‑borne malware that straddled both worlds. A decade later, “polyglot” files that were valid as more than one format squeezed past single‑channel filters. The pattern: once you add a second channel, attackers learn to play harmony while our controls still listen for solos.
For practitioners, the implication is clear: unimodal red‑teaming is not enough for SLMs. Test cross‑modal interactions explicitly. JAMA provides a strong, if compute‑hungry, white‑box upper bound; SAMA gets you close at a lower cost and is a practical baseline for internal evaluation. Pay attention to model‑ and audio‑specific effects during testing; the study finds that details like audio initialisation matter. Treat guardrails as a system across the speech and text paths, not a pair of independent filters.
Open questions remain. How well do these joint attacks transfer to black‑box or non‑differentiable pipelines? Do streaming SLMs introduce timing quirks that blunt or amplify joint perturbations? And can we learn detectors that recognise the distinct embedding subspaces the paper reports for successful multimodal jailbreaks without spiking false positives? As ever, the interesting work now shifts from proving the risk to measuring and containing it in the wild.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
On Optimizing Multimodal Jailbreaks for Spoken Language Models
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies safety vulnerabilities of Spoken Language Models (SLMs) arising when attackers perturb both speech and text inputs simultaneously. Prior jailbreak work mostly perturbs a single modality; this may underestimate risk because SLMs combine audio and text and can exhibit cross-modal failure modes. The authors ask whether robustness to unimodal attacks transfers to the multimodal setting and demonstrate that joint attacks can much more reliably induce harmful, compliant responses.
Approach
The authors introduce JAMA, a white-box joint audio–text optimisation framework that combines Greedy Coordinate Gradient (GCG) for discrete text suffix optimisation and Projected Gradient Descent (PGD) for continuous audio perturbations. Both modalities are optimised simultaneously to maximise the probability of an affirmative, harmful response across a batch of malicious queries. Experiments use four state-of-the-art SLMs (Audio Flamingo 3, Qwen2 Audio (7B, Instruct), Gemma 3N, and Qwen2.5 Omni (7B)) and four base audio types (audiobook read speech, two switchboard conversational samples, and music). Attacks are evaluated with string matching and LLaMA Guard 3, using greedy decoding. A sequential approximation called SAMA first optimises text without audio and then optimises audio conditioned on the fixed suffix, reducing compute cost.
Key Findings
- Joint audio–text attacks significantly outperform unimodal attacks, yielding up to 10× higher jailbreak success in some configurations and generally producing multi‑fold improvements across models and audio types.
- JAMA often amplifies strong text attacks and can enable jailbreaks where unimodal attacks fail; it is particularly effective when short text suffixes and modest audio perturbations interact.
- The optimisation dynamics show that text (GCG) dominates early optimisation, yielding rapid loss reduction, while audio (PGD) provides slower, continuous improvements later; JAMA leverages this complementary behaviour.
- Representation analysis finds that successful joint attacks occupy subspaces separated from benign inputs and from unimodal attacks; a linear classifier on reduced embeddings achieves near‑perfect separation under strong configurations.
- SAMA, the sequential approximation, matches JAMA closely at larger token and audio lengths while reducing compute by about 4× to 6×, making it a fast and strong baseline.
- Model and audio specifics matter: PGD-only attacks were generally weaker than text attacks, music initialisation was often the most effective PGD base, and Gemma 3N showed exceptional resistance to some unimodal attacks but remained vulnerable to joint attacks under certain conditions.
Limitations
Results are obtained in a white-box setting that requires differentiable audio feature extractors and access to model gradients; many deployed systems use non-differentiable components or gradient‑shattering preprocessors. Evaluations rely on a specific benchmark split and greedy decoding, and attack hyperparameters (step counts, norms, step sizes) were fixed. Some model behaviours were dataset or initialisation dependent and one model (Gemma 3N) behaved as an outlier in several scenarios.
Why It Matters
The work shows that unimodal robustness is insufficient for SLM safety: attackers who can jointly manipulate speech and text can substantially increase jailbreak success. Defences therefore need to consider cross‑modal interactions and the composite attack space. The sequential approximation provides a practical tool for red‑teaming and robustness benchmarking, and the authors release code and data to enable replication and stronger multimodal safety evaluation.