ShortSpan.ai logo

Anti-I2V disrupts deepfake image-to-video generation models

Defenses
Published: Thu, Mar 26, 2026 • By Natalie Kestrel
Anti-I2V disrupts deepfake image-to-video generation models
New research proposes Anti-I2V, an adversarial cloaking method to stop a single portrait photo being turned into a convincing video by diffusion-based image-to-video models. It perturbs colour and frequency domains and targets internal layers to break identity and temporal coherence. Results look strong, but some limits remain.

Deepfake video no longer needs much fuel. One photo plus a helpful text prompt is often enough for modern image-to-video systems to spin up a moving, lip-synced clip. The paper behind Anti-I2V takes aim at that pipeline by making the source photo a hostile surface. Rather than nudging RGB pixels, it perturbs colour and frequency information and messes with what the video model learns over time.

What it does

Anti-I2V applies small, crafted changes in two places that video diffusion models tend to rely on. First, it alters the a and b channels of the L a b colour space, which shifts chroma while leaving luminance alone. Second, it tweaks low-frequency coefficients in the Discrete Cosine Transform, which influence broad structure rather than fine detail. The intent is to look normal to people while starving the model of stable, identity-rich signals.

On top of that, the method targets how the denoiser represents the subject as it removes noise over time. The authors locate intermediate layers with high semantic content in both UNet and Diffusion Transformer backbones. They then optimise two losses: Internal Representation Collapse, which drags deep features toward low-level early features, and Internal Representation Anchor, which pushes features toward or away from those of an unrelated latent. Auxiliary perceptual losses are used to steady the optimisation.

In tests on CelebV-Text and UCF101 with models including CogVideoX-5B, OpenSora v1.2 and DynamiCrafter, the dual-space perturbation outperforms RGB-only updates. Reported identity and quality scores drop sharply. For DynamiCrafter, identity similarity falls to 0.151 with a Q-Align V of 0.047, compared to clean outputs at 0.528 and 0.794. On UCF101 the method reports 0.068 for identity and 0.084 for Q-Align V versus 0.384 and 0.562. The perturbations also remain more stable under JPEG compression, blur, Gaussian noise and several purification procedures. Transfer between diffusion transformer models holds up, and even cross-architecture transfer to UNet variants is described as competitive.

Where it falls short

There are caveats that matter for deployment. The optimisation uses 200 iterations and a perturbation budget of 16 over 255 in L a b. That is workable for a defender preparing a profile photo, but not cheap at scale. More importantly, the method computes gradients through a surrogate video diffusion model to place and shape the noise. The paper shows decent transfer between some models, but real attackers often use proprietary services. If your cloak is tuned on one open model and the adversary uses another with different preprocessing or a stronger face encoder, transfer could degrade.

The evaluation only feeds the first four frames into the optimisation to save memory. That is understandable, but it sidesteps the question of how protection holds as long videos accumulate evidence across dozens or hundreds of frames. The authors also note that some perceptual metrics understate how subtle the changes look, since colour-space edits inflate pixel-wise differences. On the flip side, platforms routinely resize, crop, convert colour spaces and strip metadata. The method survives several common distortions, but social platforms stack them, and face tracking can re-centre crops. The tests do not cover that full gauntlet.

Baselines are limited by available implementations, and adding RGB updates on top of the dual-space approach hurts performance because the small budget has to be split. That suggests the technique is sensitive to how the budget is allocated. A determined adversary will also try video-specific countermeasures, such as purification tuned for diffusion transformers, identity restoration, or fine-tuning with adversarial training against known cloaks.

Still, the core idea lands. Hitting colour and frequency where the model leans on stable identity cues, then corrupting internal representations that drive temporal coherence, is a defensible way to raise the cost of person-specific video generation. If you run media authentication or need to publish high-risk images, this is a promising layer. The open questions are about longevity under platform pipelines and future model hardening, not about whether the current results are real.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Authors: Duc Vu, Anh Nguyen, Chi Tran, and Anh Tran
Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies the risk that a single portrait photo can be used to generate realistic, temporally coherent fake videos via modern image-to-video diffusion models. Existing image cloaking and adversarial-defence techniques mainly target image generation or UNet-based architectures and often operate in RGB space, making them less effective against large diffusion transformer models that retain richer features and stronger temporal consistency. The work addresses the need for a defence that degrades identity preservation and temporal coherence across diverse video diffusion backbones.

Approach

Anti-I2V combines a dual-space perturbation strategy with layer-wise feature disruption during the denoising process. The perturbation optimisation is performed in two non-RGB domains: the L*a*b* colour space, with noise applied to the a* and b* channels, and the low-frequency coefficients of the Discrete Cosine Transform. The authors identify semantically rich intermediate layers in UNet and Diffusion Transformer denoisers using feature visualisation and PCA, and design two losses: Internal Representation Collapse (IRC), which aligns deep-layer features with low-semantic early-layer features, and Internal Representation Anchor (IRA), which pushes perturbed-layer features toward those of an unrelated target latent (or away in untargeted settings). Auxiliary perceptual losses (LPIPS and CLIP feature loss) are included. Optimisation uses 200 iterations, a perturbation budget of 16/255, and experiments use only the first four video frames to reduce memory usage. Evaluations use two benchmarks, CelebV-Text (face-centric) and UCF101 (human actions), and multiple models including CogVideoX-5B, OpenSora v1.2, DynamiCrafter and a transfer test on Wan2.2-TI2V-5B. Baselines include several open-source cloaking methods.

Key Findings

  • Dual-space perturbations outperform RGB-only updates: combining L*a*b* and low-frequency DCT perturbations (DSP) yields stronger and more robust degradation of generated videos than perturbing in RGB alone or in either domain separately.
  • Strong reduction in identity and video quality metrics: Anti-I2V achieves substantially lower identity-similarity scores and video-quality measures across models. Example results include an ISM of 0.151 and Q-Align(V) of 0.047 on DynamiCrafter compared with clean outputs of 0.528 and 0.794, and an ISM of 0.068 with Q-Align(V) 0.084 on UCF101 compared with clean 0.384 and 0.562.
  • Layer-wise objectives are complementary and effective: IRC and IRA individually reduce identity preservation and, when combined, produce further drops in identity and feature-similarity metrics, indicating disrupted feature propagation and temporal coherence.
  • Robustness and transferability: DSP is more stable under JPEG compression, blur, Gaussian noise and purification methods, and Anti-I2V transfers well between diffusion-transformer models. Cross-architecture transfer to UNet-based models shows weaker but still competitive protection.
  • Practical settings: a perturbation budget of L*a*b* = 16/255 provides a balance between imperceptibility and protection; experiments ran on a single NVIDIA A100 40GB GPU.

Limitations

Evaluation is limited to the chosen benchmarks, models and the first four frames as input, which reduces computational cost but may not reflect full-length generation scenarios. Transferability degrades in some cross-architecture cases. Perceptual-similarity metrics such as PSNR and SSIM can underestimate human imperceptibility because L*a*b* perturbations alter colour in perceptually subtle ways while producing larger pixel-wise differences. Some prior methods were excluded due to resource or implementation constraints, so comparisons are limited to available baselines. Adding RGB-domain updates can dilute DSP effectiveness because the fixed budget must be split.

Why It Matters

Anti-I2V presents a model-agnostic, practical defence against malicious image-to-video generation by targeting deeper representations and temporal feature propagation rather than only pixel values. Its dual-space perturbations and layer-wise objectives meaningfully reduce identity preservation and temporal coherence, making automated and malicious creation of realistic person-specific videos harder. The method is applicable to contemporary diffusion-transformer based systems and can be integrated into privacy and media-authentication workflows, though ongoing work is needed to assess robustness against future purification and model-hardening techniques.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.