VEIL Exploits Text-to-Video Models' Hidden Cues
Attacks
Text-to-video systems promised richer creation tools and, for defenders, new headache scenarios. A recent paper describes VEIL, an attack that quietly weaponises cross-modal habits in models to produce semantically unsafe videos without obvious malicious prompts. The technique matters because it hides intent in plain sight, making text-only filters a blunt, and often ineffective, instrument.
How VEIL works
VEIL composes prompts from three modular parts. Neutral scene anchors keep the surface text plausible. Latent auditory triggers are innocuous-sounding audio descriptions, for example creaking or muffled noises, that exploit learned audio–visual co-occurrence to nudge the model toward particular visual concepts. Stylistic modulators are cinematic directions such as camera framing or atmosphere that amplify and stabilise the latent cue. The researchers formalise prompt search as a constrained optimisation, using a Large Language Model (LLM) guided search to navigate combinations while two oracles score outcomes: a visual oracle captions generated videos and provides a visual safety score, and a textual oracle evaluates prompt stealth.
The attack assumes a black-box API and treats the target like any other external service. That is practical for an attacker. The authors test over seven text-to-video models using a curated set of 700 prompts across fourteen safety categories and five second videos, with generation and captioning aided by GPT 4o and a video captioning model. VEIL achieves a 23 percentage point improvement in average attack success rate over competing methods. One striking case is the Hailuo model where VEIL reaches 60 percent ASR compared with 28 percent for Ring A Bell, 31 percent for DACA and 33 percent for the baseline. VEIL also proves more robust under an LLM-based defence, maintaining about 30 percent ASR on the Seedance model where the next best method is near 6 percent. Ablation studies show each modular component matters: removing the anchor harms stealth, removing the trigger removes the harmful outcome, and removing the modulator reduces consistency.
The work is clear about limits. The search routine is computationally heavy, results depend on the fidelity of captioning and safety scores, and the evaluation covers seven models so it may not generalise to every text-to-video system. Still, the paper exposes a practical, repeatable route to bypassing common guardrails.
Defence implications and practical steps
The key lesson is simple and uncomfortable: text-only safety checks are not enough. Models learn cross-modal priors and those priors can be weaponised through benign language. Defences should therefore include multimodal checks, adversarial testing against modular prompt templates, and stronger alignment work that links text filters to likely visual outcomes rather than to surface text alone.
Two concrete actions for security teams
- Red-team your T2V pipelines using modular prompts that mimic anchor, trigger and modulator patterns to see what slips through.
- Integrate a visual-oracle or automated captioning check into your content pipeline so prompts that look safe but yield unsafe visuals get flagged for human review.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language
🔍 ShortSpan Analysis of the Paper
Problem
The paper investigates jailbreaking text-to-video T2V models by exploiting cross modal latent knowledge rather than relying on obviously unsafe prompts. It shows that benign looking prompts containing rich implicit cues can drive models to generate semantically unsafe videos that violate policy while preserving the blocked intention. This reveals a new attack surface in multimodal generators and motivates a safer design for guarding against implicit cross modal prompt injections.
Approach
VEIL (Visual Exploitation via Implicit Language) uses a modular prompt design that combines three components: neutral scene anchors which provide plausible surface level scene descriptions; latent auditory triggers which are innocuous sounding textual descriptions of audio events that leverage learned audio visual co occurrence priors to bias visual concepts; and stylistic modulators which are cinematic directives that amplify and stabilise the latent trigger. The attack is formalised as a constrained optimisation over the modular prompt space and solved with a guided search procedure that balances stealth and effectiveness. The search uses two oracles: a visual oracle that captions generated videos and yields a visual safety score, and a textual oracle that evaluates the prompt itself for malice. An adversarial grammar restricts prompts to a grammar P grammar composed of anchor, trigger and modulator components, and a stealth constraint ensures prompts remain inconspicuous to text based filters. A zeroth order optimisation with an LLM guided search navigates the discrete prompt space, iterating up to three steps or until a sufficiently harmful and aligned prompt is found. The evaluation uses seven T2V models including both open source and commercial systems, a curated 700 prompt subset covering fourteen safety categories, and a generation setup of five second videos with GPT 4o and a video captioning model aiding the assessment. Hyper parameters and default settings are disclosed and used to study robustness and performance.
Key Findings
- VEIL achieves a substantial improvement in attack success on commercial models, with an average ASR increase of 23 percentage points over competing methods.
- Across seven models, VEIL demonstrates strong effectiveness, notably on the Hailuo model where VEIL attains 60.0% ASR compared with 28.0% for Ring A Bell, 31.0% for DACA and 33.0% for the baseline, showing a significant uplift in harshly guarded systems.
- The attack relies on a three component prompt: an anchor for plausibility, a trigger for core instruction, and a modulator to shape mood and context. Ablation shows removing the anchor collapses stealth and lowers ASR in many aspects, removing the trigger drastically reduces harmful outcomes, and removing the modulator weakens the effect and consistency across categories.
- Hyper parameter studies reveal a balanced objective and stealth constraints are crucial. A middle ground for stealth (tau around 0.2) yields best results; aligning the prompt with the intended unsafe content using a moderate value for lambda (around 0.5) produces the highest ASR across models; too strict or too lax settings reduce effectiveness.
- VEIL remains more robust than baselines when facing an LLM based defence that filters prompts, maintaining notable ASR on a challenging model, Seedance, at 30% compared with 6% for the next best method. This reflects the cross modal nature of VEIL which hides harmful intent in safe looking components that are difficult to anticipate using text analysis alone.
- Defence implications are clear: to mitigate such attacks, stronger multimodal alignment and safety controls are required, including defenses against implicit or cross modal prompt injections and more robust prompt filtering that accounts for latent cues.
Limitations
The approach is computationally intensive due to the LLM guided search and repeated calls to the target model and captioning oracles; results rely on the quality of the visual and textual safety scores, which may introduce noise; evaluation is conducted on seven models and may not generalise to all T2V systems; the threat model assumes a black box API with no internal access to safety mechanisms.
Why It Matters
The work highlights an important safety risk for multimodal generators: the latent cross modal associations learned during training can be weaponised to bypass guardrails. It underscores the need for stronger multimodal alignment and safety controls, including defence strategies against implicit cross modal prompts and adversarial testing with modular prompt scenarios. The societal risks include stealthy deepfake style videos that could be used for misinformation, manipulation or abuse without triggering obvious safety alarms.