Bandits pick winning LLM jailbreaks at scale
New research shows a simple recipe for high-success jailbreaks: use bandit algorithms to pick prompts and add complexity to malicious queries. Tested on 15 open-weight LLMs, the approach reaches up to 97% success when sampling five jailbreaks per query. It also transfers across models, resists missing prompts, and favours complex requests.
Jailbreaks are everywhere, but picking the right one for a given Large Language Model (LLM) has been more art than science. This paper turns it into a search problem you can automate. Treat each known jailbreak as an option, try a few, learn which ones work on your target, then exploit that edge at scale.
How the attack works
The authors cast jailbreak selection as a multi-armed bandit. Each jailbreak is an arm. You run a small exploration phase, score model outputs for harmfulness with a rater model, and update your beliefs using standard bandit algorithms like Thompson Sampling or EXP3. Then you fix the learnt policy and hit a larger exploitation set, optionally continuing to update online.
In practice, it works alarmingly well. Across 15 open-weight models (from 0.27B to 120B parameters), sampling the top few jailbreaks per query drives attack success rates up to 97% on average when k=5. Even single-pass selection beats a naive uniform attacker by roughly 27–37%. Continual attacks add another 5–6% on top. The key point: you do not need to brute force a huge list. A few informed probes quickly surface model-specific jailbreaks.
It also transfers. Policies learnt on proxy models often hold up on targets with only a small drop, and domain hold-outs cost around six percentage points. Remove the best-known jailbreaks and the bandits still clear 50% success, while uniform sampling collapses. If you try to brute force under a tight budget, a bad prompt ordering can sink you; the bandit’s regret guarantees make that much less likely.
Query complexity is an attack surface
The other dial is the query itself. Baseline harmful output without any jailbreak averages about 44% on this benchmark. Complex queries are inherently riskier (about 50% versus 39% for simple). Adding complexity to prompts can raise success by up to 26%, and this can be automated. The team built FrankensteinBench, 11,279 malicious queries drawn from seven sources, enhanced and labelled by an ensemble classifier. They ran roughly 12 million query–response pairs, with generous output budgets of up to 2,048 tokens, and used a dedicated judge model for scoring.
There are limits. Most tests hit English, single-turn chats, and open-weight models, with only light checks on proprietary APIs. Multi-turn dynamics, multilingual targets and evolving jailbreak inventories are still unknowns. But the signal is clear: adaptive, bandit-style prompt selection plus automated query hardening is a low-effort path to high-success jailbreaks. Expect attackers to industrialise this.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries
🔍 ShortSpan Analysis of the Paper
Problem
The paper examines whether a non-expert adversary can reliably elicit harmful, actionable responses from modern open-weight large language models (LLMs). It frames the threat as requiring two ingredients: an effective jailbreak (a prompt transformation that increases harmful outputs) and an effective malicious query. The work assesses how easily an "average Jane" can discover and apply both at scale and whether automated strategies reduce the technical burden of successful attacks.
Approach
The authors cast jailbreak selection as a sequential multi-armed bandit problem and test both transfer and continual attack scenarios. They run partial-information bandit algorithms (EXP3, Thompson Sampling, LinUCB, LinearCB, SquareCB and full-information RWM) to learn distributions over 70 jailbreaks using an exploration set and then apply the learnt policy to an exploitation set. They also curate FrankensteinBench, a safety benchmark of 11,279 malicious queries drawn from seven existing datasets, automatically enhanced and labelled as simple or complex using an ensemble complexity classifier (best ensemble accuracy 89.17%). Experiments span 15 open-weight target models (0.27B–120B parameters), judge responses with a dedicated rater model, and evaluate roughly 12 million query–response pairs with up to 2,048 tokens of output per response.
Key Findings
- Bandit-based selection consistently outperforms naive strategies: Thompson Sampling and EXP3 beat a uniform-priors attacker and can identify non-obvious, model-specific jailbreaks.
- Extremely high attack success rates (ASR): with multiple sampled jailbreaks per query (k=5), average ASR across models reaches as high as 97% in some settings; single-pass bandit gains over uniform priors were reported as ~27–37%.
- Complex queries increase vulnerability: baseline ASR (no jailbreak) is ~44% across the test set; complex queries show ~50% baseline ASR versus ~39% for simple queries, and adding complexity can raise ASR by up to 26% for some jailbreaks; average complex−simple ASR gap across jailbreaks is ~9%.
- Continual attacks (online updating during exploitation) provide modest additional gains (~5–6% ASR) over transfer attacks where the policy is fixed during exploitation.
- Transferability is practical: exploration on different domains or on proxy models yields similar performance (domain-holdout differences ~6%; proxy models often transfer well, e.g., smaller gemma variants transferring with only a few percentage points drop).
- Robustness to poor jailbreak sets: bandit algorithms can still achieve ASR ≥50% even when the top-performing jailbreaks are removed, whereas uniform sampling degrades substantially.
- Budget-constrained brute force can fail under adversarial orderings; bandit regret guarantees make the learning approach more reliable.
Limitations
Experiments are mostly on open-weight English models and single-turn interactions; proprietary APIs were only lightly tested. Multi-turn attacks, multilingual settings and evolving jailbreak inventories were not fully explored. Some evaluations (e.g., judge selection, complexity labelling) rely on specific LLMs and manual curation; dataset access is controlled to limit misuse.
Implications
Offensive security: an automated adversary with modest resources can discover highly effective jailbreaks and enhance base queries to substantially increase harmful outputs. Bandit-based optimisation reduces probing cost, transfers across domains and models, and remains effective when many known jailbreaks are absent, making adaptive, low-effort attacks feasible in the wild. Defenders should assume attackers can perform efficient online optimisation of prompts and that complexity-augmentation of queries amplifies risk.