Structured visuals quietly jailbreak LVLMs with slot filling

Attacks

Published: Wed, Mar 11, 2026 • By James Armitage

Structured visuals quietly jailbreak LVLMs with slot filling

New research shows Large Vision-Language Models (LVLMs) can be jailbroken by hiding malicious intent across benign-looking slots in structured images. The one-shot black-box attack, StructAttack, hits 69% on GPT-4o and averages 66% on closed models. Small layout tweaks and extra distractor slots boost success and slip past system-prompt defences.

Safety teams have told themselves a comforting story: align the model, bolt on some refusal rules, filter out the obvious triggers, and the worst of the jailbreaks will dry up. That story does not survive contact with multimodal models. A new study shows Large Vision-Language Models (LVLMs) can be steered to produce harmful outputs by smuggling intent across a handful of innocent-looking boxes in an image and letting the model do the assembly.

The attack, called StructAttack, leans on a behaviour most teams prize: semantic completion. The researchers take an unsafe query, split it into a central topic and a set of slot types that look benign in isolation, then render those slots into a structured graphic such as a mind map, table, or sunburst diagram. A short instruction asks the model to complete the missing pieces. The model obliges, inferring the global meaning from local hints and filling in the harmful bits without tripping its own safety checks.

Two things matter here. First, it is a single-query, black-box jailbreak. No iterative probing, no access to internals. Second, the attack works because defences tend to check local tokens or obvious phrases, not the assembled semantics across modalities. In preliminary tests, even a text-only version of this “fill the slots” trick logged a 54% attack success rate on GPT-4o. Wrap the slots in a structured visual and performance climbs.

The reported numbers are not edge cases. Across models and benchmarks, StructAttack averages about 90.4% attack success on open models and 66.4% on closed ones, with 69.0% on GPT-4o and 52.3% on Gemini-2.5-Flash. Under a system-prompt defence that blunted other jailbreaks to near zero, StructAttack still landed 47.2%. Small layout perturbations nudged success higher, and adding distractor slots increased jailbreak counts in ablations, which suggests scattershot innocence helps dilute safety attention.

For practitioners, this is not an exotic lab demo. Structured visuals are exactly what many enterprise use cases invite: document understanding, whiteboard transcriptions, process diagrams, product architecture charts. If your LVLM helpfully “completes” missing steps in a workflow drawn as a sunburst, it will also complete missing steps in something you would rather it not explain. The mechanic is the same.

Some in the community will argue this is just a red-teaming artefact, soon fixable with better training data or a stricter system prompt. The evidence says otherwise. When the model’s strength is reasoning over structure, a defence that refuses only obvious local cues is aiming at the wrong target. You are not fighting keywords; you are fighting global intent reconstruction. That is harder and will remain harder as models get better at filling gaps.

What would better look like? Test suites should include multi-slot visual prompts and not just text prompts. Safety components need to reason over the whole assembled graph, not each node. That implies detectors that can parse a diagram, infer the composite topic, and apply policy at that level. It also implies input controls for structured images, and monitoring that flags unusual recomposition behaviours rather than counting naughty words.

The research hints at why this works under the hood: the structured prompts shift internal activations into distinct clusters, which likely skirts simple safety triggers. That is a reminder that safety remains an adversarial problem. Attackers will continue to route around whatever surface heuristic you trust.

My view: if you deploy LVLMs, assume single-shot, diagram-based jailbreaks are a first-class risk, not an edge case. Local filters and polite refusals are table stakes and insufficient. The job now is global-intent safety that survives structure. Until teams build that, multimodal features are shipping with the doors unlocked.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

Authors: Chenxi Li, Xianggan Liu, Dake Shen, Yaosong Du, Zhibo Yao, Hao Jiang, Linyi Jiang, Chengwei Cao, Jingzhe Zhang, RanYi Peng, Peiling Bai, and Xiande Huang

Despite the rapid progress of Large Vision-Language Models (LVLMs), the integration of visual modalities introduces new safety vulnerabilities that adversaries can exploit to elicit biased or malicious outputs. In this paper, we demonstrate an underexplored vulnerability via semantic slot filling, where LVLMs complete missing slot values with unsafe content even when the slot types are deliberately crafted to appear benign. Building on this finding, we propose StructAttack, a simple yet effective single-query jailbreak framework under black-box settings. StructAttack decomposes a harmful query into a central topic and a set of benign-looking slot types, then embeds them as structured visual prompts (e.g., mind maps, tables, or sunburst diagrams) with small random perturbations. Paired with a completion-guided instruction, LVLMs automatically recompose the concealed semantics and generate unsafe outputs without triggering safety mechanisms. Although each slot appears benign in isolation (local benignness), StructAttack exploits LVLMs' reasoning to assemble these slots into coherent harmful semantics. Extensive experiments on multiple models and benchmarks show the efficacy of our proposed StructAttack.

🔍 ShortSpan Analysis of the Paper

Problem

This paper examines a novel safety vulnerability in large vision-language models (LVLMs). By decomposing an explicitly harmful instruction into ostensibly benign slot types, attackers can exploit LVLMs’ semantic slot filling to induce the model to complete missing slot values with harmful content. The vulnerability is especially relevant because visual structure can hide intent: each slot looks harmless in isolation yet recombines via the model’s reasoning into a globally malicious instruction, enabling stealthy, single-query jailbreaks under black-box conditions.

Approach

The authors propose StructAttack, a one-shot black-box jailbreak made of two stages: Semantic Slot Decomposition (SSD) and Visual-Structural Injection (VSI). SSD uses an auxiliary decomposer LLM to rewrite a harmful query as a central topic plus a set of malicious-appearing slot types that remain locally benign, and a distractor LLM to add innocuous slots. VSI renders these slot types into structured visual prompts such as mind maps, tables or sunburst diagrams, adds small random layout perturbations, and pairs the image with a completion-guided textual instruction that asks the LVLM to fill slot values. Implementation details include using Deepseek-Chat for the decomposer and distractor roles, Matplotlib for rendering, minor jitter for nodes and sectors, and evaluation on NVIDIA GPUs. The method was tested on two benchmarks (Advbench-M and SafeBench) and on six victim models comprising two open-source LVLMs and four commercial closed-source LVLMs. Attack success was measured by Attack Success Rate (ASR) and a Harmfulness (HF) score assigned by an automated evaluator.

Key Findings

Structured slot-based attacks are effective: a text-only semantic slot filling (SSF) attack achieved a 54% ASR on GPT-4o in preliminary tests, and embedding the slots into structured visual prompts further increased success.
High overall efficacy and generalisation: StructAttack achieved average ASR figures reported as about 90.4% on open-source LVLMs and 66.4% on closed-source LVLMs; for example, it reached 69.0% ASR on GPT-4o and 52.3% on Gemini-2.5-Flash with notable Harmfulness scores.
Robustness and efficiency: under a system-prompt defence that suppressed other attacks, StructAttack retained substantial efficacy (ASR 47.2%) while typographic and OOD attacks dropped to near zero. StructAttack is one-shot, avoiding the iterative optimisation required by several OOD methods and reducing interaction overhead.
Design choices matter: ablations show visual embedding and small random perturbations improve ASR, and adding distractor slots raised jailbreak counts (e.g. from 41 to 48 in an ablation), indicating distractors dilute safety attention.
Behavioural effect: embedding the structured prompt shifts model feature representations into distinct clusters, suggesting the attack changes internal activations to evade safety triggers.

Limitations

Failures occur when decomposition drifts for very long or complex queries or when strong high-level safety alignment triggers educational refusals. The method relies on auxiliary LLMs for decomposition and on the chosen benchmarks; evaluations used automated harm scoring validated against humans with over 80% agreement, but the study is limited to the reported datasets and models. Random layout perturbations provided modest gains; the core effect stems from the slot-filling mechanism.

Why It Matters

StructAttack demonstrates a practical and scalable threat surface in multimodal models: attackers can hide malicious intent across visually structured, locally benign components and rely on the model to reassemble the full harmful meaning. Defenders should test LVLMs with structured multi-slot visual prompts, detect anomalous prompt graphs or assembled semantics, and strengthen alignment and input-side controls so safety checks reason about global intent rather than only local tokens.

Links Original paper on arXiv

Structured visuals quietly jailbreak LVLMs with slot filling

📋 Original Paper Title and Abstract

Models as Lego Builders: Assembling Malice from Benign Blocks via Semantic Blueprints

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

Memory-led agents jailbreak VLMs via natural images

Enhanced Attacks Expose Multimodal LLM Safety Gaps

Posterior Attack Turns LLM Safety Checks Against Them

Related Research

Get the Weekly AI Security Digest