ShortSpan.ai logo Home

Study Exposes Multimodal AI Jailbreaks with Simple Tricks

Attacks
Published: Fri, Oct 24, 2025 • By Dr. Marcus Halden
Study Exposes Multimodal AI Jailbreaks with Simple Tricks
A new study tests multimodal large language models (MLLMs) and finds simple visual and audio tricks can bypass safety filters. The authors convert 1,900 dangerous text prompts into images and audio, then apply modest perceptual changes. Attacks often succeed—frequently over 75%—exposing real risks for multimodal AI systems.

Multimodal large language models (MLLMs) that read text, images and audio are becoming standard in products and research. This paper examines a worrying gap: models that refuse dangerous text often accept the same content when it arrives as a slightly altered image or a manipulated audio file. The researchers translate 1,900 adversarial text prompts into visual and audio inputs and then apply perceptual but human-readable transformations to test seven frontier models.

What the researchers did

They build a simple, repeatable red‑teaming pipeline. Text prompts across three high‑risk categories—harmful content, CBRN (chemical, biological, radiological, nuclear) and CSEM (child sexual exploitation material)—are converted into images and audio using a synthetic pipeline. On the visual side they evaluate FigStep and FigStep‑Pro, which decompose or rearrange keywords into visual pieces, and Intelligent Masking, which semantically obscures parts of an image. For audio they test Wave‑Echo, Wave‑Pitch and Wave‑Speed perturbations. Seven models are probed, including Llama‑4 variants, GLM 4.5v, GPT‑4o for vision, Gemini variants and GPT‑4o Audio. Responses are automatically judged with GPT‑4.1 and scored as attack success rates (ASR).

The headline results are simple and stark. Models that have near‑perfect text‑only safety show ASRs above 75% under perceptual modification, and FigStep‑Pro reaches up to 89% ASR against some Llama‑4 variants. Visual tricks avoid simple optical character recognition (OCR) defences by breaking keywords into subimages. Intelligent Masking, which hides or distorts semantic parts of an image, also produces substantial bypasses against multiple providers. On audio, Wave‑Echo produces roughly 75% ASR for Gemini 2.5 Flash and 74% for GPT‑4o Audio on CBRN prompts, and other pitch and speed perturbations increase success further. Even basic modality transfer—turning text into an audio file without heavy obfuscation—can yield around 25% ASR for technical queries.

There are two practical conclusions. First, per‑modality filters are not enough. A text filter that blocks a dangerous instruction gives a false sense of safety if the same instruction can be routed through image or audio channels and slip past the same guardrails. Second, these attacks are not exotic. The transformations are perceptually simple and require minimal technical skill, which raises real operational risk for deployed systems.

The study is careful about limits. It covers only vision and audio in English, uses synthetic conversions rather than real criminal media, and results vary by provider and model version. The authors also avoid generating illegal material. Still, the work highlights a structural weakness: alignment and moderation that do not reason across modalities leave a large attack surface.

Defences should move beyond isolated filters. The paper suggests cross‑modal semantic checks, consistent internal representations of intent across inputs, preprocessing to normalise inputs, and layered moderation that treats semantically equivalent content the same way regardless of form. Practically, teams should update test suites and threat models to include image and audio transforms.

  • Test multimodal inputs in red teams, not just text.
  • Add cross‑modal consistency checks and normalise incoming media.
  • Prioritise domain‑specific evaluation for high‑risk categories like CBRN and CSEM.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Authors: Divyanshu Kumar, Shreyas Jena, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, and Prashanth Harshangi
Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks that exploit weaknesses in cross-modal processing. We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models, showing that even simple perceptual transformations can reliably bypass state-of-the-art safety filters. Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories harmful content, CBRN (Chemical, Biological, Radiological, Nuclear), and CSEM (Child Sexual Exploitation Material) tested against seven frontier models. We explore the effectiveness of attack techniques on MLLMs, including FigStep-Pro (visual keyword decomposition), Intelligent Masking (semantic obfuscation), and audio perturbations (Wave-Echo, Wave-Pitch, Wave-Speed). The results reveal severe vulnerabilities: models with almost perfect text-only safety (0\% ASR) suffer >75\% attack success under perceptually modified inputs, with FigStep-Pro achieving up to 89\% ASR in Llama-4 variants. Audio-based attacks further uncover provider-specific weaknesses, with even basic modality transfer yielding 25\% ASR for technical queries. These findings expose a critical gap between text-centric alignment and multimodal threats, demonstrating that current safeguards fail to generalize across cross-modal attacks. The accessibility of these attacks, which require minimal technical expertise, suggests that robust multimodal AI safety will require a paradigm shift toward broader semantic-level reasoning to mitigate possible risks.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates vulnerabilities in multimodal large language models by examining multimodal jailbreaks that bypass safety filters through perceptual transformations applied to vision and audio inputs. It demonstrates that models with strong text only safety can still produce unsafe outputs when vision or audio modalities are manipulated, exposing a gap between unimodal and multimodal safety. The study tests 1,900 adversarial prompts across three high risk categories harmful content, CBRN (Chemical, Biological, Radiological, Nuclear) and CSEM (Child Sexual Exploitation Material), against seven frontier models to assess cross modal safety weaknesses and the practicality of attacks for the cybersecurity community.

Approach

The authors develop a multimodal red teaming framework that converts textual adversarial prompts into cross modal inputs and then applies perceptually constrained transformations to encode the toxic content in other modalities. They evaluate two vision language techniques, FigStep and FigStep-Pro, and Intelligent Masking for visual inputs, alongside audio perturbations including Wave Echo, Wave Pitch and Wave Speed. The data generation relies on a SAGE-RT based synthetic pipeline to produce cross modal red teaming data, with a two stage process in which text prompts are transformed into image or audio representations and then perceptually altered while keeping human interpretability. The assessment covers seven frontier models across vision language and audio language, including Llama 4 variants, GLM 4.5v, GPT 4o for vision, and Gemini variants and GPT 4o Audio for audio, with an automatic binary judge (GPT 4.1) classifying responses into 12 categories to yield an Attack Success Rate ASR measure.

Key Findings

  • Across harmful content, CBRN and CSEM, simple perceptual transformations yield high vulnerability even when text safety is near perfect; attack success rates exceed 75 percent for specialised content, with FigStep-Pro achieving up to 89 percent ASR on Llama 4 variants and FigStep alone achieving notable gains over text only baselines.
  • Visual attacks show strong cross model transferability; FigStep-Pro decomposes keywords into visually separated sub images and bypasses OCR based guards, producing ASR as high as 40.8 percent for harmful content and 89 percent for CBRN in Llama 4 models; Intelligent Masking also delivers substantial effectiveness, especially against GLM 4.5v and GPT 4o, despite obscuring parts of input.
  • CSEM outputs are particularly brittle at baseline but remain vulnerable under advanced obfuscation, with near zero baseline ASR but 3 to 9 percent ASR against Llama 4; other providers show varying levels of resistance, indicating provider specific weaknesses in safety training.
  • Audio based attacks reveal substantial weaknesses in cross modal safety; Wave Echo achieves 75.0 percent ASR for Gemini 2.5 Flash and 74.0 percent for GPT 4o Audio on CBRN content, while Wave Pitch and Wave Volume further raise ASR; even basic modality transfer can yield 25 percent ASR for technical queries, and a 0 percent text ASR can still correspond to high audio ASR for identical content in some cases such as GPT 4o.
  • Across providers, patterns emerge where some models exhibit visual vulnerability while others are more susceptible to audio, illustrating inconsistent safety training across modalities; general harm remains poorly generalised, with most transformations inducing lower ASR than domain specific content.
  • Overall the attacks remain accessible with minimal technical skill, highlighting the practical risk of multimodal jailbreaks and the need to shift safety strategies toward cross modal semantic reasoning and evaluation rather than per modality filters.

Limitations

The study is limited to two modalities, vision and audio, and to English language evaluation with standard accents; it does not cover 3D perception, video, or cross modal generation beyond the tested modalities. Some results are provider specific and depend on current model versions and API providers, and the CSEM evaluation focuses on textual patterns rather than visual material. The ethical safeguards prevent the creation of illegal material and prompts are tested for safety refusal rather than generation of harmful content, acknowledging that future attacks could combine multiple strategies to increase effectiveness.

Why It Matters

The findings reveal serious gaps between text centric safety and multimodal threat models, showing that simple perceptual transformations can bypass safety filters across vision language and audio language systems. The work provides concrete, repeatable attack techniques with high success rates across guard rails and models, emphasising cross modal risks for AI systems that operate across modalities. It underscores the need for defensive shifts toward semantic level cross modal reasoning, cross modal consistency checks, and input preprocessing to mitigate risks. The authors stress societal and security concerns including potential misuse for harmful content spread and exploitation of multimodal systems, calling for hardened moderation, multi layer safety strategies, and robust evaluation frameworks that explicitly assess cross modal vulnerabilities. The bottom line is that current multimodal AI safety is not robust against cross modal manipulation, demanding urgent attention from researchers, industry and policy makers to redesign evaluation, certification and defence strategies.


← Back to Latest