ShortSpan.ai logo

REFORGE breaks concept unlearning with image-based red teaming

Pentesting
Published: Wed, Mar 18, 2026 • By Rowan Vale
REFORGE breaks concept unlearning with image-based red teaming
REFORGE shows concept unlearning in image generators can be bypassed with black-box, image-based prompts. Using cross-attention masks from a proxy model and stroke-based initialisation, it recovers erased concepts across nudity, object and style tasks. It runs faster than prior baselines and exposes a gap in current deployment threat models.

Image generation models (IGMs) produce convincing pictures, but they also repeat copyrighted or offensive content. A common mitigation is unlearning, which removes specific concepts from a trained model without a full retrain. The open question has been whether these defences hold up when an attacker uses images as part of the prompt, not just text.

REFORGE answers that with a practical red-teaming framework. It treats the target as a black box, meaning the attacker only sees inputs and outputs and has no access to weights or gradients. The attack crafts adversarial image prompts that coax supposedly erased concepts back into the generated output.

What REFORGE does

Cross-attention is the part of the model that links words in the prompt to regions in the image. REFORGE uses a public proxy model to read those links and focus the attack where it matters. It starts by turning a reference image that contains the banned concept into a coarse, stroke-like rendering, which keeps layout and colour while stripping detail. This makes a natural-looking starting point for an image prompt.

From the proxy model, it aggregates cross-attention activations to build a spatial mask that highlights concept-relevant areas. It then optimises the image in the proxy model’s Variational Autoencoder (VAE) latent space, which is a compressed representation of the picture. The optimisation uses a simple mean squared error objective to align the adversarial latent towards the reference latent, but only in the masked regions. The result is an image prompt that looks innocuous yet contains the right cues to resurrect the erased concept when combined with a text prompt. The authors keep a practical budget: about 100 sampling steps and a limit of ten generation queries per adversarial example when probing the unlearned target.

Across three unlearning targets—nudity, an object category (parachute), and Van Gogh’s style—REFORGE raises attack success rates over representative black-box baselines. It also tightens semantic alignment between the generated image and the prompt, measured by CLIP similarity, which the paper attributes to the stroke-based start that preserves global composition. The cross-attention mask matters too: concentrating perturbations in concept-relevant regions makes the attacks both more effective and less visually obvious.

Speed is notable. Under the same constraints, average adversarial example generation takes roughly 35 seconds for REFORGE, compared with about 290 seconds for SneakyPrompt, 320 seconds for Ring-A-Bell, and 1000 seconds for MMA. For anyone running a red team under real-world limits, that difference is the gap between testing a handful of cases and testing hundreds.

Limits and implications

The setup relies on a public proxy model to obtain cross-attention maps and run the optimisation. Results are shown on Stable Diffusion variants, specific unlearning implementations, and a single-GPU budget. The evaluation covers 150, 45, and 48 prompt–reference pairs for the nudity, parachute, and Van Gogh-style tasks, respectively. These are solid but bounded conditions.

Even so, the message is clear: unlearning as currently practiced is vulnerable to multi-modal attacks. If you operate unlearned IGMs, include image-side adversarial testing in pre-deployment checks and ongoing monitoring. Treat unlearning as one layer in a defence-in-depth setup rather than a primary gate. Focus testing per concept, not only in aggregate, because objects and styles behave differently. Finally, plan for attackers with small query budgets; the paper’s ten-query limit still proved enough.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Authors: Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong, Li Tang, Wei Zhou, and Renyang Liu
Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether image generation model unlearning, a set of techniques that remove specific harmful or copyrighted concepts from pretrained image generators without full retraining, remains robust when an attacker can supply adversarial image prompts in a black-box setting. This matters because modern image generation models can reproduce copyrighted or offensive content, and unlearning is proposed as a practical mitigation; if unlearning can be bypassed by realistic multi-modal inputs, deployed safety controls may be insufficient.

Approach

The authors introduce REFORGE, a black-box red-teaming framework that crafts adversarial image prompts to induce the re-emergence of erased concepts. REFORGE operates without access to the target model’s parameters or gradients; it uses a public proxy image generation model to compute cross-attention maps and perform optimisation. The pipeline has four stages: (1) stroke-based initialization that converts a reference image containing the erased concept into a coarse, stroke-rendered image preserving layout and colour cues while removing fine details; (2) mask construction via aggregation and normalisation of cross-attention activations from the proxy model to produce a spatial mask that emphasises concept-relevant regions; (3) latent-alignment optimisation that iteratively updates the adversarial example in the proxy model’s VAE latent space, aligning it towards the reference latent under the guidance of the attention mask and an MSE alignment loss; (4) red-teaming evaluation by querying the unlearned model with the adversarial image plus text prompt and measuring whether the erased concept reappears. Experiments use a 100-step sampling budget and a practical query limit of 10 generation calls per adversarial example.

Key Findings

  • REFORGE improves attack success rates relative to representative black-box baselines, showing higher rates of concept recovery across three unlearning categories: Nudity, Object (Parachute) and Van Gogh style.
  • REFORGE achieves better semantic alignment between generated images and prompts, measured by CLIP similarity, attributed to stroke-based initialisation that preserves global composition while enabling focused perturbations.
  • The cross-attention guided mask concentrates perturbations in concept-relevant regions, improving effectiveness and reducing visible perturbation compared with uniform updates.
  • REFORGE is substantially more efficient: average adversarial example generation time is about 35 seconds versus roughly 290s for SneakyPrompt, 320s for Ring-A-Bell and 1000s for MMA, under the same black-box constraints.
  • Ablations show robustness to choice of reference image, task-dependent optimal attention timesteps and layer depths, and that MSE loss outperforms Cosine and L2 objectives for alignment.

Limitations

Evaluation assumes access to a public proxy model to extract cross-attention and compute optimisation steps, a fixed query budget of ten generation calls per example, and specific unlearning methods and datasets (150, 45 and 48 prompt-reference pairs retained for Nudity, Parachute and Van Gogh-style respectively). Experiments use Stable Diffusion variants and a single GPU setup, so results are constrained to the tested architectures, unlearning implementations and practical budgets.

Why It Matters

The work demonstrates that current IGM unlearning mechanisms can be circumvented by realistically constructed, black-box multi-modal attacks that exploit image inputs, exposing a gap in threat models and deployed defences. This has direct security implications for providers of closed-source image generation services and for policy and engineering efforts: unlearning should be evaluated under multi-modal adversaries, defences should be robustness-aware, and operators should consider red-teaming that includes image-side attacks before deployment.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.