REFORGE breaks concept unlearning with image-based red teaming
Pentesting
Image generation models (IGMs) produce convincing pictures, but they also repeat copyrighted or offensive content. A common mitigation is unlearning, which removes specific concepts from a trained model without a full retrain. The open question has been whether these defences hold up when an attacker uses images as part of the prompt, not just text.
REFORGE answers that with a practical red-teaming framework. It treats the target as a black box, meaning the attacker only sees inputs and outputs and has no access to weights or gradients. The attack crafts adversarial image prompts that coax supposedly erased concepts back into the generated output.
What REFORGE does
Cross-attention is the part of the model that links words in the prompt to regions in the image. REFORGE uses a public proxy model to read those links and focus the attack where it matters. It starts by turning a reference image that contains the banned concept into a coarse, stroke-like rendering, which keeps layout and colour while stripping detail. This makes a natural-looking starting point for an image prompt.
From the proxy model, it aggregates cross-attention activations to build a spatial mask that highlights concept-relevant areas. It then optimises the image in the proxy model’s Variational Autoencoder (VAE) latent space, which is a compressed representation of the picture. The optimisation uses a simple mean squared error objective to align the adversarial latent towards the reference latent, but only in the masked regions. The result is an image prompt that looks innocuous yet contains the right cues to resurrect the erased concept when combined with a text prompt. The authors keep a practical budget: about 100 sampling steps and a limit of ten generation queries per adversarial example when probing the unlearned target.
Across three unlearning targets—nudity, an object category (parachute), and Van Gogh’s style—REFORGE raises attack success rates over representative black-box baselines. It also tightens semantic alignment between the generated image and the prompt, measured by CLIP similarity, which the paper attributes to the stroke-based start that preserves global composition. The cross-attention mask matters too: concentrating perturbations in concept-relevant regions makes the attacks both more effective and less visually obvious.
Speed is notable. Under the same constraints, average adversarial example generation takes roughly 35 seconds for REFORGE, compared with about 290 seconds for SneakyPrompt, 320 seconds for Ring-A-Bell, and 1000 seconds for MMA. For anyone running a red team under real-world limits, that difference is the gap between testing a handful of cases and testing hundreds.
Limits and implications
The setup relies on a public proxy model to obtain cross-attention maps and run the optimisation. Results are shown on Stable Diffusion variants, specific unlearning implementations, and a single-GPU budget. The evaluation covers 150, 45, and 48 prompt–reference pairs for the nudity, parachute, and Van Gogh-style tasks, respectively. These are solid but bounded conditions.
Even so, the message is clear: unlearning as currently practiced is vulnerable to multi-modal attacks. If you operate unlearned IGMs, include image-side adversarial testing in pre-deployment checks and ongoing monitoring. Treat unlearning as one layer in a defence-in-depth setup rather than a primary gate. Focus testing per concept, not only in aggregate, because objects and styles behave differently. Finally, plan for attackers with small query budgets; the paper’s ten-query limit still proved enough.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies whether image generation model unlearning, a set of techniques that remove specific harmful or copyrighted concepts from pretrained image generators without full retraining, remains robust when an attacker can supply adversarial image prompts in a black-box setting. This matters because modern image generation models can reproduce copyrighted or offensive content, and unlearning is proposed as a practical mitigation; if unlearning can be bypassed by realistic multi-modal inputs, deployed safety controls may be insufficient.
Approach
The authors introduce REFORGE, a black-box red-teaming framework that crafts adversarial image prompts to induce the re-emergence of erased concepts. REFORGE operates without access to the target model’s parameters or gradients; it uses a public proxy image generation model to compute cross-attention maps and perform optimisation. The pipeline has four stages: (1) stroke-based initialization that converts a reference image containing the erased concept into a coarse, stroke-rendered image preserving layout and colour cues while removing fine details; (2) mask construction via aggregation and normalisation of cross-attention activations from the proxy model to produce a spatial mask that emphasises concept-relevant regions; (3) latent-alignment optimisation that iteratively updates the adversarial example in the proxy model’s VAE latent space, aligning it towards the reference latent under the guidance of the attention mask and an MSE alignment loss; (4) red-teaming evaluation by querying the unlearned model with the adversarial image plus text prompt and measuring whether the erased concept reappears. Experiments use a 100-step sampling budget and a practical query limit of 10 generation calls per adversarial example.
Key Findings
- REFORGE improves attack success rates relative to representative black-box baselines, showing higher rates of concept recovery across three unlearning categories: Nudity, Object (Parachute) and Van Gogh style.
- REFORGE achieves better semantic alignment between generated images and prompts, measured by CLIP similarity, attributed to stroke-based initialisation that preserves global composition while enabling focused perturbations.
- The cross-attention guided mask concentrates perturbations in concept-relevant regions, improving effectiveness and reducing visible perturbation compared with uniform updates.
- REFORGE is substantially more efficient: average adversarial example generation time is about 35 seconds versus roughly 290s for SneakyPrompt, 320s for Ring-A-Bell and 1000s for MMA, under the same black-box constraints.
- Ablations show robustness to choice of reference image, task-dependent optimal attention timesteps and layer depths, and that MSE loss outperforms Cosine and L2 objectives for alignment.
Limitations
Evaluation assumes access to a public proxy model to extract cross-attention and compute optimisation steps, a fixed query budget of ten generation calls per example, and specific unlearning methods and datasets (150, 45 and 48 prompt-reference pairs retained for Nudity, Parachute and Van Gogh-style respectively). Experiments use Stable Diffusion variants and a single GPU setup, so results are constrained to the tested architectures, unlearning implementations and practical budgets.
Why It Matters
The work demonstrates that current IGM unlearning mechanisms can be circumvented by realistically constructed, black-box multi-modal attacks that exploit image inputs, exposing a gap in threat models and deployed defences. This has direct security implications for providers of closed-source image generation services and for policy and engineering efforts: unlearning should be evaluated under multi-modal adversaries, defences should be robustness-aware, and operators should consider red-teaming that includes image-side attacks before deployment.