Jailbreak Foundry turns papers into runnable LLM attacks
Pentesting
Jailbreak research on Large Language Models (LLMs) moves faster than most evaluation setups. Teams end up comparing different datasets, prompts and judges, then call it progress. The result is noise. Jailbreak Foundry (JBF) aims to clean this up by turning papers into runnable attack modules and evaluating them in one consistent harness.
What it does
JBF is a three-part system. JBF-LIB defines a stable module contract and bundles the common runtime bits most jailbreaks need. JBF-FORGE is a multi-agent pipeline that reads a paper, plans an implementation, writes the code and statically audits it for fidelity. JBF-EVAL is the standard runner that fixes the dataset, execution protocol and judging so results are comparable. In the default setup, it uses AdvBench with a GPT-4o judge.
The key claim is fidelity and speed. Across 30 reproduced attacks, the mean difference between reproduced and reported Attack Success Rate (ASR) is +0.26 percentage points, with a range from −16.0 to +20.0. Synthesis is quick: mean 28.2 minutes, median 25.0, and 82% of runs finish within an hour.
The engineering hygiene is solid too. For 22 attacks where official code existed, integrating into JBF cut total lines of code from 22,714 to 9,549, a compression ratio of 0.42. On average, 82.5% of the resulting codebase is shared framework infrastructure and only 17.5% is attack-specific. That is the boring, necessary scaffolding most teams end up rebuilding anyway.
Reference repositories matter. For a sample of five methods, using the authors' runnable repos increased mean ASR from 66.5% to 86.3%. That suggests a lot of the real protocol sits in defaults and glue code, not just the headline technique.
Where it helps
Standardised evaluation exposes how attacks and models interact. Under a single AdvBench harness across 10 victim models with a consistent GPT-4o judge, many reproduced attacks remain potent: at moderate thresholds, 20 of 30 reach at least 50% ASR on six or more models, and several clear 90% on multiple targets. Results vary by mechanism and model. GPT-5.1 averages 29.5% ASR but ranges from 0 to 94%, while GPT-3.5-Turbo, GPT-4o and Qwen3-14B show higher mean ASR. Search-heavy methods with the victim in the loop and attacks wrapped in formal carriers can be especially effective on some models.
If you run an internal red team, this gives you a living benchmark rather than a one-off bake-off. The pattern is straightforward: fix your dataset and judging protocol, treat jailbreaks as modules behind a stable contract, and prefer authors' runnable repos when reproducing. You get reproducibility, less bespoke code to maintain and clearer deltas as models change.
There are limits. The auditor is static and does not execute code, so provider-specific defaults and protocol edge cases can slip through. The system hews tightly to the plan and contract, which can leave underspecified details unresolved. Some reproductions undershoot reported ASR, with a few notable outliers. Today, coverage is restricted to papers that report on selected benchmarks, and the pipeline leans on reference repos to patch gaps. The authors also flag dual-use risk, as easier attack integration lowers the barrier for misuse.
Still, this is the kind of infrastructure that makes security work scale. It is not a silver bullet for model safety, but it is a clean, auditable way to keep jailbreak testing current and comparable. If you want to know whether a new defence actually moves the needle, putting it through a harness like this is a sensible start.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
🔍 ShortSpan Analysis of the Paper
Problem
The paper addresses a practical gap in evaluation of jailbreak attacks on large language models: jailbreak techniques evolve far faster than benchmarks, yielding stale and incomparable robustness estimates across papers. Manual integration of new attacks into evaluation frameworks is slow, error prone and dependent on individual engineers, which prevents timely, apples-to-apples comparisons and longitudinal tracking of model security.
Approach
The authors present JAILBREAK FOUNDRY (JBF), a system that automatically converts jailbreak research papers into runnable attack modules and evaluates them under a unified harness. JBF has three components: JBF-LIB, a shared library that defines a stable module contract and reusable runtime utilities; JBF-FORGE, a multi-agent planner–coder–auditor pipeline that translates a paper into a JBF-LIB-compatible module; and JBF-EVAL, a standardised evaluation runner that fixes datasets, execution protocols and judging. The FORGE pipeline uses three agent roles: a planner that extracts a structured specification from the paper, a coder that implements the module according to the contract, and an auditor that performs line-referenced static fidelity checks. When reproductions undershoot reported results substantially, an enhanced refinement pass performs read-only gap analysis and scoped patches using a longer-running agent. JBF-EVAL runs reproduced attacks on a common dataset and judge (AdvBench with a GPT-4o judge by default) across multiple victim models and records standard artefacts and metrics.
Key Findings
- High-fidelity reproduction: Across 30 reproduced attacks, JBF achieves a mean reproduced minus reported attack success rate deviation of +0.26 percentage points, with an overall range from −16.0 to +20.0 percentage points.
- Efficiency: FORGE produces benchmark-ready modules quickly—mean synthesis time 28.2 minutes and median 25.0 minutes, with 82% of runs finishing within 60 minutes.
- Engineering compression and reuse: Integrating 22 attacks with official code reduced total lines of code from 22,714 to 9,549, a compression ratio of 0.42, and on average 82.5% of the integrated codebase derives from shared framework infrastructure while 17.5% is attack-specific.
- Value of reference repositories: For a sample of five methods, including scaffold-heavy ones, using the authors' runnable repositories raised mean ASR from 66.5% to 86.3%, indicating repositories clarify implicit defaults and scaffolding.
- Robust attacks transfer broadly: Under a standardised AdvBench harness across 10 victim models, many reproduced attacks retain high effectiveness. At moderate thresholds, 20 of 30 attacks achieve at least 50% ASR on six or more victims; several attacks exceed 90% ASR on multiple victims.
- Attack–model interaction effects: Robustness is highly mechanism- and model-dependent. For example, GPT-5.1 has mean ASR 29.5% but ranges 0 to 94%, while GPT-3.5-Turbo, GPT-4o and Qwen3-14B show much higher mean ASR. Search strategies and carrier formats also vary in effectiveness by victim: victim-in-the-loop search and formal wrappers can be especially effective on some models.
Limitations
Auditing is static and does not execute code; the auditor enforces strict plan-and-contract precedence which can leave protocol or provider-specific defaults unresolved. Some reproductions undershot reported ASR, concentrated in a few outliers. JBF currently restricts reproductions to papers reporting on selected benchmarks and relies on reference repositories for resolving underspecification. The system is dual use and the authors note risks from lowering the operational barrier to run jailbreaks.
Why It Matters
JBF demonstrates automated, auditable translation of jailbreak research into runnable tests, enabling living benchmarks that keep pace with attack innovation. Standardised modules and a common evaluation harness permit direct cross-paper and cross-model comparisons, reveal mechanism-specific blind spots in deployed models and reduce engineering overhead for security assessments. The authors recommend careful governance and responsible release to mitigate dual-use risks.