MUSE exposes and hardens multi-turn LLM jailbreaks
Pentesting
Lede: A recent study introduces MUSE, a dual-purpose framework that systematically finds multi-turn jailbreaks in Large Language Models (LLMs) and then reduces those vulnerabilities by aligning model responses at the turn level. This matters because many deployed chat systems face attacks that build up across several exchanges, not just a single malicious prompt.
Define terms: A Large Language Model (LLM) is a statistical model trained on lots of text to generate plausible language. Monte Carlo Tree Search (MCTS) is a decision process that explores possible action sequences by sampling outcomes and favouring promising branches. Direct Preference Optimisation (DPO) is a way to fine-tune models using preference comparisons rather than raw labels.
Scope and stakes: The paper evaluates MUSE on a range of open and commercial models, including Llama 3 variants, Qwen, GPT-4o and Claude 3.5, and shows that multi-turn context can meaningfully increase jailbreak success. For practitioners, that means existing single-turn defences are necessary but not sufficient for chat deployments.
How it works: MUSE has two parts. MUSE-A is an automated red team that builds an action space from three frame-semantics strategies and uses MCTS to generate diverse, contextually plausible attack trajectories. A judge model scores responses on a 1-to-10 scale and trajectories that drop below five are treated as successful jailbreaks. MUSE-D collects high-risk turns from those trajectories and fine-tunes the target model using turn-level DPO with a reported beta setting and training regimen to reduce vulnerability.
Impact and risk: The authors report that MUSE-A outperforms prior multi-turn attack methods and that larger models can be more susceptible to context-driven attacks. MUSE-D reduces attack success rates substantially, with claims of up to a 24 percentage point improvement versus standard DPO, while preserving task performance on standard benchmarks.
Mitigations and next steps: Practitioners can use MUSE-style testing as part of pre-deployment audits. Minimal viable controls include turn-level safety checks; better options include incorporating preference data from adversarial trajectories into fine-tuning; best practice would combine adversarial testing with iterative adversarial training and online monitoring.
- Good: impose turn-level safety gates and logging.
- Better: fine-tune with high-risk trajectory data via DPO-style methods.
- Best: deploy continuous adversarial testing, iterative retraining and runtime monitoring.
Limitations and caveats: The defence is static rather than online, relies on a judge model and curated data, and the attack approach could be misused if published without safeguards. The authors note those boundaries and provide redacted artefacts.
Kicker: MUSE does not erase the problem, but it gives security teams a reproducible way to map multi-turn attack paths and to harden models at the conversational turn where many real-world exploits begin.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
🔍 ShortSpan Analysis of the Paper
Problem
Multi turn jailbreaks in large language models occur when adversaries exploit the dialogue history to induce unsafe content across several turns. The paper argues that most existing defence mechanisms focus on single turn prompts, while real world usage involves extended conversations in which context can bypass safety guardrails. It introduces MUSE, a comprehensive red teaming framework tackling multi turn jail breaks from both attack and defence angles. For attacks, MUSE A uses frame semantics and heuristic tree search to explore diverse semantic trajectories, and for defence, MUSE D provides fine grained safety alignment that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on multiple models show that MUSE effectively identifies and mitigates multi turn vulnerabilities. Code is available at https://github.com/yansiyu02/MUSE.
Approach
MUSE comprises two components MUSE A for automated multi turn jailbreak attacks and MUSE D for turn level safety alignment. MUSE A builds an action space from three frame semantics strategies expansion decomposition and redirection to perturb the malicious target question while preserving contextual coherence, and uses Monte Carlo Tree Search to navigate this space and generate contextually plausible attack queries x t conditioned on the current dialogue c t minus 1. Attacks aim to maximise the probability that the target model outputs unsafe content, while the defender responds with y t and a safety check y t prime to assess safety. A joint strategy is defined over attacker and defender actions; safety is evaluated by a judge model scoring responses on a 1 to 10 scale, with scores below five treated as successful jailbreaks. The process iterates up to a maximum number of turns T T. MUSE D collects a preference dataset from all MUSE A trajectories including successful endpoints and high risk intermediate nodes, unified as preference triples ( c tilde t , y t, y safe t ) where c tilde t represents the dialogue history and q is the malicious target query, and uses these data to fine tune the target model with turn level Direct Preference Optimisation, using beta equal to 0 4 and training for three epochs; a threshold tau of five identifies high risk nodes. This turn level alignment improves safety across diverse models and attack types while preserving reasoning ability. The evaluation uses HarmBench and JailbreakBench for attack assessment, Beavertails for safety alignment training data, and standard benchmarks such as GSM8K MMLU GPQA and MT Bench for general capability checks. The framework is evaluated on models including Llama 3 variants Qwen and commercial models such as GPT 4 o and Claude 3 5, with hardware comprising forty NVIDIA A 100 GPUs and the vLLM serving framework.
Key Findings
- MUSE A consistently outperforms prior multi turn jailbreak methods across HarmBench and JailbreakBench, nearly doubling the average attack success rate of the best previous approach and showing higher vulnerability in larger models such as Llama 3 70B and GPT 4 o.
- Contextual multi turn attacks are more effective than single turn prompts; larger models are more susceptible to context based attacks; the approach yields large improvements for strong aligned commercial models such as GPT 4 o and Claude 3 5 compared with baselines.
- MUSE D substantially enhances safety across model scales and architectures, achieving up to a twenty four percentage point reduction in attack success rate compared with standard Direct Preference Optimisation, while preserving reasoning and task performance on GSM8K MMLU GPQA and MT Bench and showing strong generalisation to cross scale and cross architecture settings.
- MUSE A extends to enhance single turn attacks by generating multi turn contexts and concatenating them with harmful queries; in combination with four single turn attacks this yields around a twenty per cent average gain in attack success rate; ablation studies show that removing any one of the frame semantics strategies reduces performance by about three and a half to ten percentage points, indicating the strategies are complementary.
- Ablation results indicate that the topic expansion strategy is particularly effective in HarmBench owing to domain knowledge needs, while target decomposition and scenario redirection perform better on JailbreakBench, illustrating that different components address distinct vulnerability classes.
- When assessing attack efficiency, MUSE A demonstrates faster convergence and higher final attack success rates than baselines, with a maximum practical discussion depth around five turns where the effects stabilise.
Limitations
The defence mechanisms studied are limited in scope and rely on static analysis rather than online learning; the work notes the potential for misuse of attack strategies and mitigates this with redacted prompts and safety warnings. The evaluation depends on a judge model and human annotations; while human agreement is substantial, the approach may still be influenced by the judge's framing. The authors acknowledge the need for online reinforcement learning and iterative adversarial training to adapt to evolving attacks, and suggest integrating such methods in future work.
Why It Matters
The study demonstrates that safety risks in LLMs extend beyond single turns and that a layered approach combining systematic attack exploration with proactive turn level safety alignment is required to robustly defend conversational systems. The MUSE framework provides a practical methodology to test, quantify, and harden multi turn safety in LLMs, with results broadly generalisable across model sizes and architectures. The work supports safer deployment of conversational AI and highlights the importance of proactive vulnerability discovery, although it does not directly address wider societal issues such as surveillance or workforce impact. The authors also provide code and sanitised prompts to foster community experimentation and defence oriented progress.