New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

Jailbreak Foundry turns papers into runnable LLM attacks

Published: Mon, Mar 02, 2026 • By Rowan Vale

Pentesting

Jailbreak Foundry turns papers into runnable LLM attacks

Jailbreak Foundry (JBF) converts jailbreak papers into runnable modules and scores them in a fixed harness. Across 30 attacks, it reproduces reported results closely (+0.26 pp ASR on average), ships in under an hour for most, and standardises AdvBench tests across 10 models with a consistent GPT-4o judge. It also cuts bespoke code and surfaces model-specific weak spots.

Jailbreak research on Large Language Models (LLMs) moves faster than most evaluation setups. Teams end up comparing different datasets, prompts and judges, then call it progress. The result is noise. Jailbreak Foundry (JBF) aims to clean this up by turning papers into runnable attack modules and evaluating them in one consistent harness.

What it does

JBF is a three-part system. JBF-LIB defines a stable module contract and bundles the common runtime bits most jailbreaks need. JBF-FORGE is a multi-agent pipeline that reads a paper, plans an implementation, writes the code and statically audits it for fidelity. JBF-EVAL is the standard runner that fixes the dataset, execution protocol and judging so results are comparable. In the default setup, it uses AdvBench with a GPT-4o judge.

The key claim is fidelity and speed. Across 30 reproduced attacks, the mean difference between reproduced and reported Attack Success Rate (ASR) is +0.26 percentage points, with a range from −16.0 to +20.0. Synthesis is quick: mean 28.2 minutes, median 25.0, and 82% of runs finish within an hour.

The engineering hygiene is solid too. For 22 attacks where official code existed, integrating into JBF cut total lines of code from 22,714 to 9,549, a compression ratio of 0.42. On average, 82.5% of the resulting codebase is shared framework infrastructure and only 17.5% is attack-specific. That is the boring, necessary scaffolding most teams end up rebuilding anyway.

Reference repositories matter. For a sample of five methods, using the authors' runnable repos increased mean ASR from 66.5% to 86.3%. That suggests a lot of the real protocol sits in defaults and glue code, not just the headline technique.

Where it helps

Standardised evaluation exposes how attacks and models interact. Under a single AdvBench harness across 10 victim models with a consistent GPT-4o judge, many reproduced attacks remain potent: at moderate thresholds, 20 of 30 reach at least 50% ASR on six or more models, and several clear 90% on multiple targets. Results vary by mechanism and model. GPT-5.1 averages 29.5% ASR but ranges from 0 to 94%, while GPT-3.5-Turbo, GPT-4o and Qwen3-14B show higher mean ASR. Search-heavy methods with the victim in the loop and attacks wrapped in formal carriers can be especially effective on some models.

If you run an internal red team, this gives you a living benchmark rather than a one-off bake-off. The pattern is straightforward: fix your dataset and judging protocol, treat jailbreaks as modules behind a stable contract, and prefer authors' runnable repos when reproducing. You get reproducibility, less bespoke code to maintain and clearer deltas as models change.

There are limits. The auditor is static and does not execute code, so provider-specific defaults and protocol edge cases can slip through. The system hews tightly to the plan and contract, which can leave underspecified details unresolved. Some reproductions undershoot reported ASR, with a few notable outliers. Today, coverage is restricted to papers that report on selected benchmarks, and the pipeline leans on reference repos to patch gaps. The authors also flag dual-use risk, as easier attack integration lowers the barrier for misuse.

Still, this is the kind of infrastructure that makes security work scale. It is not a silver bullet for model safety, but it is a clean, auditable way to keep jailbreak testing current and comparable. If you want to know whether a new defence actually moves the needle, putting it through a harness like this is a sensible start.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Authors: Zhicheng Fang, Jingjie Zheng, Chenxu Fu, and Wei Xu

Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.

🔍 ShortSpan Analysis of the Paper

Problem

The paper addresses a practical gap in evaluation of jailbreak attacks on large language models: jailbreak techniques evolve far faster than benchmarks, yielding stale and incomparable robustness estimates across papers. Manual integration of new attacks into evaluation frameworks is slow, error prone and dependent on individual engineers, which prevents timely, apples-to-apples comparisons and longitudinal tracking of model security.

Approach

The authors present JAILBREAK FOUNDRY (JBF), a system that automatically converts jailbreak research papers into runnable attack modules and evaluates them under a unified harness. JBF has three components: JBF-LIB, a shared library that defines a stable module contract and reusable runtime utilities; JBF-FORGE, a multi-agent planner–coder–auditor pipeline that translates a paper into a JBF-LIB-compatible module; and JBF-EVAL, a standardised evaluation runner that fixes datasets, execution protocols and judging. The FORGE pipeline uses three agent roles: a planner that extracts a structured specification from the paper, a coder that implements the module according to the contract, and an auditor that performs line-referenced static fidelity checks. When reproductions undershoot reported results substantially, an enhanced refinement pass performs read-only gap analysis and scoped patches using a longer-running agent. JBF-EVAL runs reproduced attacks on a common dataset and judge (AdvBench with a GPT-4o judge by default) across multiple victim models and records standard artefacts and metrics.

Key Findings

High-fidelity reproduction: Across 30 reproduced attacks, JBF achieves a mean reproduced minus reported attack success rate deviation of +0.26 percentage points, with an overall range from −16.0 to +20.0 percentage points.
Efficiency: FORGE produces benchmark-ready modules quickly—mean synthesis time 28.2 minutes and median 25.0 minutes, with 82% of runs finishing within 60 minutes.
Engineering compression and reuse: Integrating 22 attacks with official code reduced total lines of code from 22,714 to 9,549, a compression ratio of 0.42, and on average 82.5% of the integrated codebase derives from shared framework infrastructure while 17.5% is attack-specific.
Value of reference repositories: For a sample of five methods, including scaffold-heavy ones, using the authors' runnable repositories raised mean ASR from 66.5% to 86.3%, indicating repositories clarify implicit defaults and scaffolding.
Robust attacks transfer broadly: Under a standardised AdvBench harness across 10 victim models, many reproduced attacks retain high effectiveness. At moderate thresholds, 20 of 30 attacks achieve at least 50% ASR on six or more victims; several attacks exceed 90% ASR on multiple victims.
Attack–model interaction effects: Robustness is highly mechanism- and model-dependent. For example, GPT-5.1 has mean ASR 29.5% but ranges 0 to 94%, while GPT-3.5-Turbo, GPT-4o and Qwen3-14B show much higher mean ASR. Search strategies and carrier formats also vary in effectiveness by victim: victim-in-the-loop search and formal wrappers can be especially effective on some models.

Limitations

Auditing is static and does not execute code; the auditor enforces strict plan-and-contract precedence which can leave protocol or provider-specific defaults unresolved. Some reproductions undershot reported ASR, concentrated in a few outliers. JBF currently restricts reproductions to papers reporting on selected benchmarks and relies on reference repositories for resolving underspecification. The system is dual use and the authors note risks from lowering the operational barrier to run jailbreaks.

Why It Matters

JBF demonstrates automated, auditable translation of jailbreak research into runnable tests, enabling living benchmarks that keep pace with attack innovation. Standardised modules and a common evaluation harness permit direct cross-paper and cross-model comparisons, reveal mechanism-specific blind spots in deployed models and reduce engineering overhead for security assessments. The authors recommend careful governance and responsible release to mitigate dual-use risks.

Links Original paper on arXiv

Jailbreak Foundry turns papers into runnable LLM attacks

What it does

Where it helps

📋 Original Paper Title and Abstract

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

PixJail turns papers into pipelines for T2I jailbreaks

EvoJail automates long-tail LLM jailbreaks at scale

Bandits pick winning LLM jailbreaks at scale

Related Research

Get the weekly digest