ShortSpan.ai logo

DAPRO targets rare LLM jailbreaks with smart budgeting

Pentesting
Published: Fri, May 08, 2026 • By Natalie Kestrel
DAPRO targets rare LLM jailbreaks with smart budgeting
New research proposes DAPRO, a dynamic budget allocator for multi‑turn Large Language Model (LLM) testing. It prioritises conversations likely to yield jailbreaks or toxic outputs and gives calibrated lower bounds on time‑to‑unsafe. Experiments on Llama 3.1 and Qwen 2.5 show tighter, lower‑variance coverage than static baselines under tight compute budgets.

Multi turn red teaming against Large Language Models (LLMs) is a slog. The nasty stuff often appears only after dozens of volleys, so fixed per chat budgets either miss the event or waste tokens on dead ends. A new paper introduces DAPRO, a dynamic allocator that learns where to dig and when to walk away, with actual statistical guarantees rather than vibes.

DAPRO runs in two phases. First it fully plays out a subset of conversations, then learns a mapping from a cheap per step score to a continuation probability. In phase two it applies that policy to the rest: promising chats get more turns, weak ones get cut. Crucially, it keeps unbiased estimates by inverse probability weighting censored trajectories, and it wraps the whole thing in conformal survival calibration to produce lower predictive bounds on time to unsafe, or upper bounds on time to success.

The theory matters. Unlike prior conformal survival work, DAPRO’s coverage does not assume the censoring process is independent of event times. The new bound scales with the square root of the mean inverse censoring weight rather than the worst case, which translates to tighter, lower variance guarantees when you allocate sensibly. In tests across jailbreaks, toxicity, retrieval augmented generation hallucinations and agent tasks using models like Llama 3.1 and Qwen 2.5, DAPRO hit coverage closer to the target with less variance than static baselines, while respecting a global budget. Typical runs used small per sample budgets relative to long horizons.

What attackers will try

This is a compute triage tool. A red team can score each step with a lightweight predictor, push depth only on high risk threads, and stop early elsewhere. Because the weighting yields unbiased population metrics, you can still estimate the true jailbreak rate and restricted mean time to unsafe under tight budgets. The learned lower bounds also tell you how many turns to keep pressing before you expect payoff.

Where the proof thins

Coverage and budget validity lean on exchangeability between calibration and deployment. You need the same attacker, target and judge distributions, and a reasonable score to probability mapping. The finite sample budget bound is conservative; exact control is in expectation and sensitive to projection error and the calibration split. You also need an auditor to label events. None of that kills the result, but it narrows where the guarantees really bite. The overlooked angle: dynamic allocation quietly amplifies late stage failures that static testers never reach. If your evaluations miss those, the numbers you trust are wrong.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Authors: Shai Feldman and Yaniv Romano
Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emph{Dynamic Allocation via PRojected Optimization} (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies how to evaluate and predict rare, safety‑critical events that occur only after repeated multi‑turn interactions with large language models (LLMs), such as jailbreaks, toxic outputs or task success. These events can be costly to detect because exhaustive multi‑turn sampling is computationally expensive and a fixed total budget means many events remain unobserved. Existing conformal survival approaches produce calibrated lower predictive bounds on time‑to‑event but use static, pre‑allocated per‑conversation budgets that waste resources and inflate variance in multi‑turn settings.

Approach

The authors introduce Dynamic Allocation via PRojected Optimization (DAPRO), a two‑phase adaptive budget allocation framework. Calibration prompts are split into two sets: Phase I fully observes a subset to learn an optimal score‑to‑continuation‑probability mapping that minimises variance under a global budget, then Phase II deploys the learnt policy to stochastically continue or stop conversations on the remaining calibration points. The method computes inverse‑probability weights for censored samples and uses conformalised survival calibration to produce lower predictive bounds (LPBs) for time‑to‑unsafe; symmetrically it can produce upper predictive bounds for time‑to‑success. DAPRO includes optimisation of per‑step continuation probabilities via projected coordinate descent, a score function derived from a trained predictive model, and projection models (e.g., Platt scaling) to map scores to probabilities. Theoretical analysis gives distribution‑free finite‑sample PAC coverage guarantees and budget validity in expectation, plus conservative finite‑sample budget bounds under additional assumptions. Empirical evaluation uses multiple datasets (toxicity, red‑team jailbreaks, RAG hallucinations, AutoIF) and LLMs including Qwen 2.5 and Llama 3.1 with tmax=200 and typical per‑sample budgets of 10–30.

Key Findings

  • DAPRO obtains finite‑sample, distribution‑free coverage guarantees without requiring conditional independence between censoring and event times, unlike prior conformal survival methods.
  • The authors derive a novel coverage bound that scales with the square root of the mean inverse‑censoring weight rather than the worst‑case weight, yielding provably tighter guarantees and motivating mean‑weight minimisation.
  • In experiments across safety and utility tasks, DAPRO achieves empirical coverage closer to the nominal target with lower variance than static allocation baselines while satisfying the global budget constraint.
  • DAPRO also yields unbiased, lower‑variance population‑level estimates (e.g., unsafe event rate, restricted mean time‑to‑unsafe) by inverse‑probability reweighting, recovering oracle metrics with reduced variance under limited compute.
  • Practical deployment: Phase I learning improves sample efficiency but consumes budget; budget validity guarantees rely on accuracy of the learnt score‑to‑probability mapping and on mild independence assumptions for theoretical bounds.

Limitations

The theoretical guarantees require exchangeability between calibration and test distributions and either accurate score‑to‑probability mappings or additional assumptions to ensure exact budget control. The finite‑sample budget bound is conservative and depends on estimation error and calibration split size; expected budget validity assumes projection errors are small or unbiased. The method needs a predictive model and an auditor to label events and assumes the same attacker, target and judge during calibration and deployment.

Implications

Offensively, the framework could be repurposed to discover jailbreak strategies or other failure modes more efficiently under limited query budgets, enabling adversaries to focus interaction effort on prompts most likely to elicit unsafe outputs and to estimate population‑level vulnerability rates with fewer trials. The authors note this dual‑use risk and recommend deploying DAPRO as one component of a broader safety and governance pipeline.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.