STARE targets vulnerability windows in diffusion red teaming

Pentesting

Published: Mon, May 04, 2026 • By Marcus Halden

STARE targets vulnerability windows in diffusion red teaming

STARE treats the diffusion denoising trajectory as the attack surface, coupling a prompt editor with low-level fine-tuning to drive toxicity in Vision-Language Models. It lifts attack success to 31.36% and surfaces phase windows: conceptual harms emerge early, detail harms late. Causal window zeroing confirms control, with strong cross-model transfer.

Most multimodal red teaming pokes at the final output and shrugs at how it got there. This paper asks a sharper question: when, during a diffusion model’s denoising, do toxic cues take shape? That “when” turns out to matter. If you can aim at the right timesteps, you can make toxicity less a roll of the dice and more a repeatable exploit.

Trajectory as the attack surface

STARE is a hierarchical reinforcement learning (RL) setup that treats the text-to-image (T2I) denoising trajectory as the place to attack. A high-level prompt editor steers semantic subgoals, while a low-level policy fine-tunes the denoiser using Group Relative Policy Optimisation (GRPO) on a rectified-flow Stable Diffusion 3.5-Medium model. The T2I is white-box for fine-tuning; the Vision-Language Model (VLM) target is query-only black-box. The reward mixes a terminal toxicity score from a classifier with an image–prompt alignment score. Training uses LoRA for parameter-efficient updates, marginal-preserving stochastic sampling for exploration, and a group-normalised advantage objective.

To see what matters when, the authors build a temporal alignment diagnostic: they perturb blocks of denoising timesteps and apply a multi-level Monte Carlo estimator to score per-timestep influence across toxicity dimensions. Think of it as a heatmap over time that shows where Identity slurs, threats, or obscenity really lock in.

Vulnerability windows, not diffuse risk

On attack strength, STARE lands a peak Any-ASR of 31.36% on the primary dataset, a 68% relative lift over the strongest black-box baseline (ART at 18.62%), and it beats a flat white-box RL variant (DDPO at 27.84%). Ablations show the hierarchy matters: removing the prompt editor or the LoRA path hurts performance, with LoRA-only at 25.56%.

The punchline is what they call Optimisation-Induced Phase Alignment. Vanilla models show fuzzy, spread-out harm. After optimisation, conceptual harms concentrate early in the semantic phase; detail-oriented harms concentrate late in refinement. Causal tests back it up: zeroing early steps collapses Identity (−62%) and Threat (−58%) while leaving Insult largely intact; zeroing late steps collapses Obscene (−33%) and Insult (−35%) while Identity and Threat are mostly untouched. Those are real handles, not just pretty plots.

It gets worse for defenders: the crafted images transfer. Trained against one VLM, they still trigger on others, with peak ASRs of 21.47% on Qwen, 15.96% on Gemini, and 13.64% on GPT-5.4. They also hold on an out-of-distribution prompt set (30.83%) and even when rendered by a different generator (FLUX.1-dev at 17.82%). Because these are realistic images, not pixel-noise hacks, typical perturbation filters are unlikely to catch them.

There are caveats: the approach assumes white-box T2I access for fine-tuning, focuses on a rectified-flow model and English datasets, and leans on specific detectors and reward weights. Still, the core insight is compelling. If toxicity formation clusters into early and late windows, phase-aware security may be possible. The open question is how universal these windows are across architectures, languages, and harm types.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

Authors: Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng, Hongying Zan, and Cong Wang

Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68\% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit diffuse toxicity, whereas adversarial optimization concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. Targeted perturbations of either window selectively suppress different toxicity categories, indicating that this temporal structure is a genuine causal handle rather than a side effect of the hierarchical design. The phenomenon turns toxicity formation from a chaotic process into a small set of predictable vulnerability windows, providing both a potent attack engine and a basis for phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies how adversarial image-text inputs induce toxic continuations in vision-language models and, crucially, when toxic semantics emerge during the multi-step image generation process. Prior red-teaming approaches treat text-to-image generation as a black box and report only final toxicity, leaving the temporal formation of harmful content unexplained. Understanding step-wise emergence matters because it reveals whether toxicity is diffuse or concentrated at particular synthesis stages, which affects both the potency of attacks and the design of mitigations.

Approach

The authors introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory of diffusion-based image generators as the attack surface. STARE couples a high-level prompt editor that steers semantic subgoals with a low-level module that fine-tunes the T2I denoiser via Group Relative Policy Optimisation (GRPO) applied to a rectified-flow text-to-image model (Stable Diffusion 3.5-Medium). The VLM target is query-only black-box; the reward is a terminal toxicity score from a toxicity classifier combined with an image-prompt alignment score. Training uses LoRA for parameter-efficient fine-tuning, marginal-preserving stochastic sampling for exploration, and a group-normalised advantage objective. A diagnostic temporal alignment pipeline perturbs blocks of denoising timesteps and uses a multi-level Monte Carlo estimator to produce per-timestep influence heatmaps across toxicity dimensions. Experiments use RealToxicityPrompts for training and PolygloToxicityPrompts for out-of-distribution evaluation, with transfer tests on other VLMs and a different generator.

Key Findings

Substantially improved attack success: STARE attains a peak Any-ASR of 31.36% on the primary dataset, a 68% relative improvement over the strongest black-box baseline (ART at 18.62%), and outperforms a flat white-box RL alternative (STARE w/ DDPO at 27.84%).
Optimization-Induced Phase Alignment: adversarial optimisation concentrates conceptual harms into an early semantic phase and detail-oriented harms into a late refinement phase; vanilla models show only diffuse, weakly structured toxicity.
Causal double dissociation via window zeroing: disabling early denoising updates collapses Identity (−62%) and Threat (−58%) toxicity while leaving Insult largely unchanged; disabling late updates collapses Obscene (−33%) and Insult (−35%) while leaving Identity and Threat nearly intact, indicating phase-specific causal control.
Transferability: adversarial images trained against one VLM transfer to other models and environments, achieving peak ASRs of 21.47% on Qwen, 15.96% on Gemini, 13.64% on GPT-5.4, 30.83% on the out-of-distribution dataset PTP, and 17.82% when deployed through a different T2I generator (FLUX.1-dev) without re-training.
Component synergy required: ablations show removing the prompt editor or the LoRA fine-tuning reduces ASR (LoRA-only 25.56%), demonstrating the hierarchical design is essential to create temporal alignment and maximise effect.

Limitations

The method assumes direct white-box access to the T2I model for fine-tuning, which limits applicability where only a closed service is available. Experiments focus on a rectified-flow generator (SD 3.5-Medium) and English toxicity datasets, so temporal patterns may differ on other architectures, languages or harm types. Results rely on specific toxicity detectors and reward weights, and the work does not claim formal causal-graph identifiability beyond the demonstrated interventions.

Implications

Offensive implications are strong: an adversary with the ability to fine-tune a T2I model and to craft prompt edits can reliably induce specific categories of toxic continuations by aligning optimisation to precise denoising windows. Phase-specific control enables selective amplification or suppression of distinct harms and yields adversarial examples that transfer to multiple VLMs and generators. Because the attacks operate via realistic image synthesis rather than low-level pixel noise, they are less likely to be removed by filters tuned to detect perturbation artefacts, making phase-aware image generation a potent attack vector against multimodal systems.

Links Original paper on arXiv

STARE targets vulnerability windows in diffusion red teaming

Trajectory as the attack surface

Vulnerability windows, not diffuse risk

📋 Original Paper Title and Abstract

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Hidden visual prompts steer multimodal LLMs

VEIL Exploits Text-to-Video Models' Hidden Cues

Adversarial images hijack LVLMs after long chats

Related Research

Get the Weekly AI Security Digest