ShortSpan.ai logo

THRD defends LLMs against multi-turn jailbreaks

Defenses
Published: Tue, Jun 02, 2026 • By Rowan Vale
THRD defends LLMs against multi-turn jailbreaks
New work on THRD tackles the jailbreaks that creep in over several turns instead of a single prompt. It tracks risk across the whole dialogue, stays training-free, and cuts attack success to 0.2–4.0% in tests while barely denting utility. It’s slower and threshold-sensitive, but the temporal modelling pays off.

Single-turn safety checks look fine in a demo, then fold when an attacker plays the long game. Multi-turn jailbreaks nudge a Large Language Model (LLM) bit by bit: harmless setup, plausible context, then the real ask. THRD takes that seriously and treats the whole chat as the attack surface.

How THRD works

It is an inference-time, training-free wrapper with four jobs. A Turn-level Risk Assessor scores the current message. A Historical Context Analyzer watches for intent that escalates across turns. A Response Evaluator flags the model’s own outputs if they help the user get to harm. A Decision Module rolls these into a time-evolving score with two tricks: attenuation that dials how much history should count given current severity, and a trend-aware boost when risk keeps rising. Trigger once and it sticks with persistent refusal for the rest of the exchange.

Against two target models and modern multi-turn attacks, it works. On a multi-agent setup (X-Teaming) and a tree-search method (Tempest), attack success drops to between 0.2% and 4.0%. Utility hits are small, about 1.5% on MMLU and GSM8K. The temporal bit is not hand-wavy either: over 70% of first detections happen at Turn 2 or later, and shuffling the history makes things worse, which means ordering matters and the history analyser is doing real work. Ablations back that up: remove the turn scorer or the history module and you pay in higher success rates; remove the response check and you leak facilitative content; disable persistent rejection and attackers come back with probing follow-ups. Several aggregation choices still keep absolute success low, with a linear decay plus L2-norm scoring looking the most stable. You do pay latency, mostly in the history analysis, with per-interaction totals around 15–22 seconds. Over-refusal is sensitive to thresholds; settings like tau_low=2.0 and tau_high=3.5 looked like a practical balance in their tests.

Attacker’s angle

If you jailbreak for a living, this paper confirms what works: hide intent early, spread it across turns, and use coordination or search to steer the chat’s trajectory. Outputs that “just” scaffold the next step are gold. THRD tries to kill that by aggregating weak signals and remembering. The weaknesses match the design: it relies on prompt-based semantic judgements, so you can optimise directly against its prompts and scoring rules; stretch the attack over longer horizons to dilute any turn’s contribution; keep probing after a refusal to test if persistence is enforced. The framework is lightweight by design, not a learned sequence model, so very long, ultra-gradual setups are the obvious next test. The results are strong on two models; the open question is how this holds under adaptive optimisation and at scale where latency budgets are tight.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

Authors: Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, and Pengyuan Liu
Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies multi-turn jailbreak attacks that exploit conversational dynamics such as gradual escalation and cross-turn coordination to bypass safety measures in large language models. It argues that existing defences either require costly retraining that can degrade utility or analyse each turn in isolation, missing how risk accumulates over a dialogue. Because dialogue history reshapes the model's conditioning context, safety decisions based on single-turn signals are often insufficient, making trajectory-aware inference-time defences necessary.

Approach

THRD is a training-free, modular inference-time framework that explicitly models temporal risk accumulation. It combines four components: a Turn-level Risk Assessor (TRA) that assigns an ordinal instant risk score for the current turn, a Historical Context Analyzer (HCA) that detects cross-turn intent escalation and coordination, a Response Evaluator (RE) that flags generated outputs that may facilitate downstream harm, and a Decision Module that aggregates these signals into a time-evolving risk score. Aggregation uses an attenuation function to modulate historical influence based on current-turn severity and a trend-aware adjustment that increases scores when risk rises monotonically. Once a refusal is triggered the framework applies persistent rejection to subsequent turns. THRD is evaluated on two target models served locally and tested against multi-agent (X-Teaming) and tree-search (Tempest) multi-turn attacks plus a single-turn baseline (AutoDAN). Metrics include attack success rate, task utility (MMLU and GSM8K), and over-refusal rate.

Key Findings

  • THRD reduces attack success rates to between 0.2% and 4.0% across the two target models and both multi-turn attack types while preserving model utility within about 1.5% degradation on MMLU and GSM8K.
  • Ablation shows non-redundant contributions: removing TRA or HCA produces the largest ASR increases, removing RE also degrades performance substantially, and removing persistent rejection increases ASR notably, confirming attackers attempt evasive follow-ups.
  • Temporal signals matter: more than 70% of multi-turn attacks are first detected at Turn 2 or later, validating the need for cross-turn aggregation; shuffling history raises ASR, indicating HCA exploits turn ordering.
  • Decision formula and aggregation choices are robust: several attenuation and aggregation variants keep absolute ASR below 2%, with a linear decay plus L2-norm pairing yielding the most stable cross-architecture results.
  • Operational trade-offs: HCA dominates latency, producing per-turn single-call latencies that make THRD slower (total 15–22s) than some single-turn defences, and over-refusal is sensitive to decision thresholds, with tau_low=2.0 and tau_high=3.5 offering a practical balance.

Limitations

THRD relies on prompt-based semantic judgements and the reasoning ability of the evaluator model, so attackers optimised against those prompts or scoring dynamics could reduce effectiveness. The framework models temporal dynamics via lightweight score aggregation rather than learned sequential models, which may limit detection of attacks that distribute intent across much longer trajectories. Persistent rejection is intentionally irreversible and may be too conservative for general-purpose deployments. Multi-module analysis over full conversation history creates runtime overhead, with HCA as the primary bottleneck.

Implications

Offensive implications include the ability for attackers to hide malicious intent in early turns and to use multi-agent coordination or tree-search to craft cross-turn escalation that evades single-turn defences. Attackers can attempt evasive follow-ups after partial rejections and may adapt to THRD's prompting and scoring rules. More gradual, longer-horizon attack trajectories and prompt-targeted optimisation represent plausible avenues for adversarial adaptation against THRD-style defences.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.