Untargeted Jailbreak Attacks Expose LLM Safety Gaps
Attacks
A recent paper describes an untargeted jailbreak attack against Large Language Models (LLM). Unlike prior gradient-based methods that optimise towards a fixed malicious target, this technique aims only to maximise the probability of an unsafe response as judged by a separate model. The change sounds small. Practically, it expands the search space, speeds optimisation and produces a broader range of harmful outputs.
The method runs in two stages. First a judge model is used to craft an adversarial but linguistically plausible unsafe response. Second the researchers project gradients through retokenisation to transfer that adversarial signal into the target LLM by updating a malicious prompt. Because the core objective is non-differentiable, they decompose it into two differentiable sub-objectives and provide theoretical justification for the decomposition. In experiments across established benchmarks and a mix of open and advanced models the untargeted approach achieved over 80 per cent attack success with only 100 optimisation iterations and outperformed several state-of-the-art, targeted attacks by more than 20 percentage points.
Two practical details matter for defenders. First, the attack uses a judge model to measure unsafety. That dependency is both a strength and a weakness: it enables efficient search but introduces reliance on the judge's alignment and coverage. Second, gradient projection needs retokenisation to map between judge and target vocabularies; that step is approximate but sufficient to transfer adversarial signals across models. The paper reports measurable transferability: when attacks trained on white box models target larger models the success does not vanish, indicating real risk to deployed services.
This pattern is familiar if you look at previous shifts in attacker behaviour. Early exploits tended to pursue specific payloads; once tooling matured attackers moved to broader, automated strategies that aimed for any vulnerability they could exploit. Untargeted optimisation is the same strategic move in the LLM era: generalise the objective, automate search, and let models reveal their weakest behaviours. The result is faster, more varied exploitation and harder-to-detect evasion of fixed filters.
What defenders should do
Practical response is straightforward and overdue. Treat alignment as a system property, not a model checkbox. Run adversarial testing that includes untargeted objectives and judge-model ensembles so you do not overfit defences to a small set of known bad outputs. Adopt defence in depth: runtime monitoring, content gating and multiple detectors with different modelling assumptions. Improve evaluation pipelines so safety checks consider tokenisation and retokenisation artefacts. Finally, include untargeted scenarios in red teams and incident playbooks; if attackers can cheaply find any unsafe output, detection that looks only for known patterns will fail.
The paper does not endanger us with revelation of an unstoppable technique. It does, however, remind practitioners that the next step in attacker evolution is often a tactical pivot to more flexible objectives. That pivot favours automation and speed. Preparing for it means hardening the whole inference stack, broadening testing objectives and assuming attackers will search for behaviour rather than a single word sequence.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Untargeted Jailbreak Attack
🔍 ShortSpan Analysis of the Paper
Problem
Approach
UJA operates in two stages. Stage one optimises an unsafe response using a judge model to maximise unsafety while preserving linguistic plausibility via a semantic consistency constraint. Stage two transfers the adversarial signal into the target LLM by gradient projection across token spaces, constructing a mapping from the judge s vocabulary to the target model s vocabulary through retokenisation. The projected gradient guides updates to the jailbreak prompt so that the target LLM output aligns with the unsafe response. A theoretical proposition supports that the two differentiable sub objectives yield an optimal solution to the original non differentiable objective under continuous approximations. The method is evaluated on two jailbreak benchmarks AdvBench and HarmBench, against six white box models and two advanced models including Llama and Qwen variants, using a range of baselines. Two judge models are employed to assess harmfulness, GPTFuzzer for response level unsafety and HarmBench Llama 2 13b cls for joint prompt response evaluation. Datasets target LLMs and baselines are described in detail but here the emphasis is on the two stage optimisation and the gradient projection that enables cross model transfer of adversarial signals. Datasets AdvBench and HarmBench are used to evaluate performance across diverse harmful queries and tasks. Target LLMs include Llama 3 8B Instruct, Llama 3 1 8B Instruct, Qwen 2 5 7B Instruct, Qwen 3 8B Instruct, Vicuna 7B v1 5, Mistral 7B Instruct, Llama 3 70B Instruct and DeepSeek R1 671B. Baselines comprise GCG, COLD Attack, AdvPrefix, DRL, I GCG, llm adaptive and PAP. Metrics rely on two harmfulness classifiers GPTFuzzer and HarmBench Llama 2 13b cls to measure unsafe outputs, with defence analyses using Perplexity, SmoothLLM and Paraphrase. All baselines are run with official defaults and budgets restricted to 100 iterations for fairness.
Key Findings
- Untargeted optimisation yields high attack success rates with a modest budget, achieving over 80 per cent success with 100 optimisation iterations and outperforming state of the art targeted attacks by more than twenty percentage points.
- On AdvBench, for target Llama 3 the attack achieves high effectiveness with ASR G 89.0 per cent and ASR H 57.0 per cent, while the best baselines trail behind at lower figures.
- On HarmBench, UJA achieves average ASR G 65.5 per cent and ASR H 71.0 per cent across five target LLMs, outperforming COLD Attack, PAP and I GCG on average.
- Across six white box LLMs the average jailbreak success rate on AdvBench is 71.9 per cent with 100 iterations, and on HarmBench the average ASR G is 65.5 per cent and ASR H 71.0 per cent, with UJA consistently higher than baselines in most scenarios.
- UJA shows strong transferability, with ASR G 35 per cent and ASR H 44 per cent when transferring from white box models to Llama3 70B Instruct on AdvBench and HarmBench, outperforming several baselines in transfer scenarios.
- Response diversity analyses reveal that UJA exposes a broader range of harmful behaviours, with embeddings spanning more clusters than fixed target approaches, indicating a wider exploration of unsafe capabilities.
- Post defence results demonstrate that UJA remains effective under defence mechanisms; for example under SmoothLLM UJA attains 60 per cent ASR G whereas the best baselines reach at most 54 per cent; under Perplexity 97 per cent and Paraphrase 46 per cent, UJA remains superior to competing methods.
- Iterative efficiency shows rapid convergence, with 25 iterations yielding 64.0 per cent ASR G and 52.0 per cent ASR H, rising to 84.8 per cent and 64.0 per cent by 100 iterations, highlighting efficient exploration under tight budgets.
- Ablation studies show Stage 1 adversarial response optimisation significantly boosts performance compared with removing it, and the choice of judge model remains robust, with UJA performing well under Llama Guard 3 or GPTFuzzer guidance.
Limitations
The approach relies on judge models to estimate unsafety, raising concerns about potential misalignment with actual harm and dependence on judge quality. Gradient projection relies on retokenisation to align token spaces, which introduces approximation errors across different vocabularies. Experiments are conducted with a 100 iteration budget, so performance with greater budgets or different computational resources may vary. Although Defence methods reduce attack success, post defence results indicate that untargeted strategies can still bypass some safety measures, suggesting the need for defence in depth and broader evaluation pipelines.
Why It Matters
The untargeted approach broadens the attack surface beyond fixed targets, revealing vulnerabilities even in safety aligned LLMs. The use of a judge model to quantify unsafety and a two stage optimisation with gradient projection enables efficient yet flexible exploitation of weaknesses, highlighting new risks for AI agents with content safety filters. Societal and security implications include the automated generation of harmful content, manipulation and evasion of moderation in public and enterprise AI services. Defenders should consider multi layer safety checks, runtime monitoring, red teaming that includes untargeted scenarios, and improved alignment and evaluation pipelines to anticipate broader jailbreak strategies.