Untargeted Jailbreak Attacks Expose LLM Safety Gaps

Attacks

Published: Mon, Oct 06, 2025 • By Theo Solander

Researchers introduce an untargeted jailbreak that seeks any unsafe output rather than a specific response. Using a judge model and a two-stage gradient projection, the attack reaches over 80% success with only 100 optimisation iterations and transfers across models. The result widens the attack surface and calls for defence in depth and untargeted red teaming.

A recent paper describes an untargeted jailbreak attack against Large Language Models (LLM). Unlike prior gradient-based methods that optimise towards a fixed malicious target, this technique aims only to maximise the probability of an unsafe response as judged by a separate model. The change sounds small. Practically, it expands the search space, speeds optimisation and produces a broader range of harmful outputs.

The method runs in two stages. First a judge model is used to craft an adversarial but linguistically plausible unsafe response. Second the researchers project gradients through retokenisation to transfer that adversarial signal into the target LLM by updating a malicious prompt. Because the core objective is non-differentiable, they decompose it into two differentiable sub-objectives and provide theoretical justification for the decomposition. In experiments across established benchmarks and a mix of open and advanced models the untargeted approach achieved over 80 per cent attack success with only 100 optimisation iterations and outperformed several state-of-the-art, targeted attacks by more than 20 percentage points.

Two practical details matter for defenders. First, the attack uses a judge model to measure unsafety. That dependency is both a strength and a weakness: it enables efficient search but introduces reliance on the judge's alignment and coverage. Second, gradient projection needs retokenisation to map between judge and target vocabularies; that step is approximate but sufficient to transfer adversarial signals across models. The paper reports measurable transferability: when attacks trained on white box models target larger models the success does not vanish, indicating real risk to deployed services.

This pattern is familiar if you look at previous shifts in attacker behaviour. Early exploits tended to pursue specific payloads; once tooling matured attackers moved to broader, automated strategies that aimed for any vulnerability they could exploit. Untargeted optimisation is the same strategic move in the LLM era: generalise the objective, automate search, and let models reveal their weakest behaviours. The result is faster, more varied exploitation and harder-to-detect evasion of fixed filters.

What defenders should do

Practical response is straightforward and overdue. Treat alignment as a system property, not a model checkbox. Run adversarial testing that includes untargeted objectives and judge-model ensembles so you do not overfit defences to a small set of known bad outputs. Adopt defence in depth: runtime monitoring, content gating and multiple detectors with different modelling assumptions. Improve evaluation pipelines so safety checks consider tokenisation and retokenisation artefacts. Finally, include untargeted scenarios in red teams and incident playbooks; if attackers can cheaply find any unsafe output, detection that looks only for known patterns will fail.

The paper does not endanger us with revelation of an unstoppable technique. It does, however, remind practitioners that the next step in attacker evolution is often a tactical pivot to more flexible objectives. That pivot favours automation and speed. Preparing for it means hardening the whole inference stack, broadening testing objectives and assuming attackers will search for behaviour rather than a single word sequence.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Untargeted Jailbreak Attack

Authors: Xinzhe Huang, Wenjing Hu, Tianhang Zheng, Kedong Xiu, Xiaojun Jia, Di Wang, Zhan Qin, and Kui Ren

Existing gradient-based jailbreak attacks on Large Language Models (LLMs), such as Greedy Coordinate Gradient (GCG) and COLD-Attack, typically optimize adversarial suffixes to align the LLM output with a predefined target response. However, by restricting the optimization objective as inducing a predefined target, these methods inherently constrain the adversarial search space, which limit their overall attack efficacy. Furthermore, existing methods typically require a large number of optimization iterations to fulfill the large gap between the fixed target and the original model response, resulting in low attack efficiency. To overcome the limitations of targeted jailbreak attacks, we propose the first gradient-based untargeted jailbreak attack (UJA), aiming to elicit an unsafe response without enforcing any predefined patterns. Specifically, we formulate an untargeted attack objective to maximize the unsafety probability of the LLM response, which can be quantified using a judge model. Since the objective is non-differentiable, we further decompose it into two differentiable sub-objectives for optimizing an optimal harmful response and the corresponding adversarial prompt, with a theoretical analysis to validate the decomposition. In contrast to targeted jailbreak attacks, UJA's unrestricted objective significantly expands the search space, enabling a more flexible and efficient exploration of LLM vulnerabilities.Extensive evaluations demonstrate that \textsc{UJA} can achieve over 80\% attack success rates against recent safety-aligned LLMs with only 100 optimization iterations, outperforming the state-of-the-art gradient-based attacks such as I-GCG and COLD-Attack by over 20\%.

🔍 ShortSpan Analysis of the Paper

Problem

Approach

UJA operates in two stages. Stage one optimises an unsafe response using a judge model to maximise unsafety while preserving linguistic plausibility via a semantic consistency constraint. Stage two transfers the adversarial signal into the target LLM by gradient projection across token spaces, constructing a mapping from the judge s vocabulary to the target model s vocabulary through retokenisation. The projected gradient guides updates to the jailbreak prompt so that the target LLM output aligns with the unsafe response. A theoretical proposition supports that the two differentiable sub objectives yield an optimal solution to the original non differentiable objective under continuous approximations. The method is evaluated on two jailbreak benchmarks AdvBench and HarmBench, against six white box models and two advanced models including Llama and Qwen variants, using a range of baselines. Two judge models are employed to assess harmfulness, GPTFuzzer for response level unsafety and HarmBench Llama 2 13b cls for joint prompt response evaluation. Datasets target LLMs and baselines are described in detail but here the emphasis is on the two stage optimisation and the gradient projection that enables cross model transfer of adversarial signals. Datasets AdvBench and HarmBench are used to evaluate performance across diverse harmful queries and tasks. Target LLMs include Llama 3 8B Instruct, Llama 3 1 8B Instruct, Qwen 2 5 7B Instruct, Qwen 3 8B Instruct, Vicuna 7B v1 5, Mistral 7B Instruct, Llama 3 70B Instruct and DeepSeek R1 671B. Baselines comprise GCG, COLD Attack, AdvPrefix, DRL, I GCG, llm adaptive and PAP. Metrics rely on two harmfulness classifiers GPTFuzzer and HarmBench Llama 2 13b cls to measure unsafe outputs, with defence analyses using Perplexity, SmoothLLM and Paraphrase. All baselines are run with official defaults and budgets restricted to 100 iterations for fairness.

Key Findings

Untargeted optimisation yields high attack success rates with a modest budget, achieving over 80 per cent success with 100 optimisation iterations and outperforming state of the art targeted attacks by more than twenty percentage points.
On AdvBench, for target Llama 3 the attack achieves high effectiveness with ASR G 89.0 per cent and ASR H 57.0 per cent, while the best baselines trail behind at lower figures.
On HarmBench, UJA achieves average ASR G 65.5 per cent and ASR H 71.0 per cent across five target LLMs, outperforming COLD Attack, PAP and I GCG on average.
Across six white box LLMs the average jailbreak success rate on AdvBench is 71.9 per cent with 100 iterations, and on HarmBench the average ASR G is 65.5 per cent and ASR H 71.0 per cent, with UJA consistently higher than baselines in most scenarios.
UJA shows strong transferability, with ASR G 35 per cent and ASR H 44 per cent when transferring from white box models to Llama3 70B Instruct on AdvBench and HarmBench, outperforming several baselines in transfer scenarios.
Response diversity analyses reveal that UJA exposes a broader range of harmful behaviours, with embeddings spanning more clusters than fixed target approaches, indicating a wider exploration of unsafe capabilities.
Post defence results demonstrate that UJA remains effective under defence mechanisms; for example under SmoothLLM UJA attains 60 per cent ASR G whereas the best baselines reach at most 54 per cent; under Perplexity 97 per cent and Paraphrase 46 per cent, UJA remains superior to competing methods.
Iterative efficiency shows rapid convergence, with 25 iterations yielding 64.0 per cent ASR G and 52.0 per cent ASR H, rising to 84.8 per cent and 64.0 per cent by 100 iterations, highlighting efficient exploration under tight budgets.
Ablation studies show Stage 1 adversarial response optimisation significantly boosts performance compared with removing it, and the choice of judge model remains robust, with UJA performing well under Llama Guard 3 or GPTFuzzer guidance.

Limitations

The approach relies on judge models to estimate unsafety, raising concerns about potential misalignment with actual harm and dependence on judge quality. Gradient projection relies on retokenisation to align token spaces, which introduces approximation errors across different vocabularies. Experiments are conducted with a 100 iteration budget, so performance with greater budgets or different computational resources may vary. Although Defence methods reduce attack success, post defence results indicate that untargeted strategies can still bypass some safety measures, suggesting the need for defence in depth and broader evaluation pipelines.

Why It Matters

The untargeted approach broadens the attack surface beyond fixed targets, revealing vulnerabilities even in safety aligned LLMs. The use of a judge model to quantify unsafety and a two stage optimisation with gradient projection enables efficient yet flexible exploitation of weaknesses, highlighting new risks for AI agents with content safety filters. Societal and security implications include the automated generation of harmful content, manipulation and evasion of moderation in public and enterprise AI services. Defenders should consider multi layer safety checks, runtime monitoring, red teaming that includes untargeted scenarios, and improved alignment and evaluation pipelines to anticipate broader jailbreak strategies.

Attribution Original paper on arXiv