Structured visuals quietly jailbreak LVLMs with slot filling
Attacks
Safety teams have told themselves a comforting story: align the model, bolt on some refusal rules, filter out the obvious triggers, and the worst of the jailbreaks will dry up. That story does not survive contact with multimodal models. A new study shows Large Vision-Language Models (LVLMs) can be steered to produce harmful outputs by smuggling intent across a handful of innocent-looking boxes in an image and letting the model do the assembly.
The attack, called StructAttack, leans on a behaviour most teams prize: semantic completion. The researchers take an unsafe query, split it into a central topic and a set of slot types that look benign in isolation, then render those slots into a structured graphic such as a mind map, table, or sunburst diagram. A short instruction asks the model to complete the missing pieces. The model obliges, inferring the global meaning from local hints and filling in the harmful bits without tripping its own safety checks.
Two things matter here. First, it is a single-query, black-box jailbreak. No iterative probing, no access to internals. Second, the attack works because defences tend to check local tokens or obvious phrases, not the assembled semantics across modalities. In preliminary tests, even a text-only version of this “fill the slots” trick logged a 54% attack success rate on GPT-4o. Wrap the slots in a structured visual and performance climbs.
The reported numbers are not edge cases. Across models and benchmarks, StructAttack averages about 90.4% attack success on open models and 66.4% on closed ones, with 69.0% on GPT-4o and 52.3% on Gemini-2.5-Flash. Under a system-prompt defence that blunted other jailbreaks to near zero, StructAttack still landed 47.2%. Small layout perturbations nudged success higher, and adding distractor slots increased jailbreak counts in ablations, which suggests scattershot innocence helps dilute safety attention.
For practitioners, this is not an exotic lab demo. Structured visuals are exactly what many enterprise use cases invite: document understanding, whiteboard transcriptions, process diagrams, product architecture charts. If your LVLM helpfully “completes” missing steps in a workflow drawn as a sunburst, it will also complete missing steps in something you would rather it not explain. The mechanic is the same.
Some in the community will argue this is just a red-teaming artefact, soon fixable with better training data or a stricter system prompt. The evidence says otherwise. When the model’s strength is reasoning over structure, a defence that refuses only obvious local cues is aiming at the wrong target. You are not fighting keywords; you are fighting global intent reconstruction. That is harder and will remain harder as models get better at filling gaps.
What would better look like? Test suites should include multi-slot visual prompts and not just text prompts. Safety components need to reason over the whole assembled graph, not each node. That implies detectors that can parse a diagram, infer the composite topic, and apply policy at that level. It also implies input controls for structured images, and monitoring that flags unusual recomposition behaviours rather than counting naughty words.
The research hints at why this works under the hood: the structured prompts shift internal activations into distinct clusters, which likely skirts simple safety triggers. That is a reminder that safety remains an adversarial problem. Attackers will continue to route around whatever surface heuristic you trust.
My view: if you deploy LVLMs, assume single-shot, diagram-based jailbreaks are a first-class risk, not an edge case. Local filters and polite refusals are table stakes and insufficient. The job now is global-intent safety that survives structure. Until teams build that, multimodal features are shipping with the doors unlocked.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints
🔍 ShortSpan Analysis of the Paper
Problem
This paper examines a novel safety vulnerability in large vision-language models (LVLMs). By decomposing an explicitly harmful instruction into ostensibly benign slot types, attackers can exploit LVLMs’ semantic slot filling to induce the model to complete missing slot values with harmful content. The vulnerability is especially relevant because visual structure can hide intent: each slot looks harmless in isolation yet recombines via the model’s reasoning into a globally malicious instruction, enabling stealthy, single-query jailbreaks under black-box conditions.
Approach
The authors propose StructAttack, a one-shot black-box jailbreak made of two stages: Semantic Slot Decomposition (SSD) and Visual-Structural Injection (VSI). SSD uses an auxiliary decomposer LLM to rewrite a harmful query as a central topic plus a set of malicious-appearing slot types that remain locally benign, and a distractor LLM to add innocuous slots. VSI renders these slot types into structured visual prompts such as mind maps, tables or sunburst diagrams, adds small random layout perturbations, and pairs the image with a completion-guided textual instruction that asks the LVLM to fill slot values. Implementation details include using Deepseek-Chat for the decomposer and distractor roles, Matplotlib for rendering, minor jitter for nodes and sectors, and evaluation on NVIDIA GPUs. The method was tested on two benchmarks (Advbench-M and SafeBench) and on six victim models comprising two open-source LVLMs and four commercial closed-source LVLMs. Attack success was measured by Attack Success Rate (ASR) and a Harmfulness (HF) score assigned by an automated evaluator.
Key Findings
- Structured slot-based attacks are effective: a text-only semantic slot filling (SSF) attack achieved a 54% ASR on GPT-4o in preliminary tests, and embedding the slots into structured visual prompts further increased success.
- High overall efficacy and generalisation: StructAttack achieved average ASR figures reported as about 90.4% on open-source LVLMs and 66.4% on closed-source LVLMs; for example, it reached 69.0% ASR on GPT-4o and 52.3% on Gemini-2.5-Flash with notable Harmfulness scores.
- Robustness and efficiency: under a system-prompt defence that suppressed other attacks, StructAttack retained substantial efficacy (ASR 47.2%) while typographic and OOD attacks dropped to near zero. StructAttack is one-shot, avoiding the iterative optimisation required by several OOD methods and reducing interaction overhead.
- Design choices matter: ablations show visual embedding and small random perturbations improve ASR, and adding distractor slots raised jailbreak counts (e.g. from 41 to 48 in an ablation), indicating distractors dilute safety attention.
- Behavioural effect: embedding the structured prompt shifts model feature representations into distinct clusters, suggesting the attack changes internal activations to evade safety triggers.
Limitations
Failures occur when decomposition drifts for very long or complex queries or when strong high-level safety alignment triggers educational refusals. The method relies on auxiliary LLMs for decomposition and on the chosen benchmarks; evaluations used automated harm scoring validated against humans with over 80% agreement, but the study is limited to the reported datasets and models. Random layout perturbations provided modest gains; the core effect stems from the slot-filling mechanism.
Why It Matters
StructAttack demonstrates a practical and scalable threat surface in multimodal models: attackers can hide malicious intent across visually structured, locally benign components and rely on the model to reassemble the full harmful meaning. Defenders should test LVLMs with structured multi-slot visual prompts, detect anomalous prompt graphs or assembled semantics, and strengthen alignment and input-side controls so safety checks reason about global intent rather than only local tokens.