Improved constitutional classifiers slash jailbreak costs

Defenses

Published: Fri, Jan 09, 2026 • By Natalie Kestrel

Improved constitutional classifiers slash jailbreak costs

Researchers present enhanced Constitutional Classifiers that defend large language models (LLMs) from universal jailbreaks while cutting compute by about 40x and keeping refusals at roughly 0.05 percent. The system evaluates full conversations, runs a cheap screening stage and escalates only risky exchanges, and uses linear probes plus ensembles to stay robust and affordable in production.

There is value in a defence that actually runs in production. This paper presents an evolution of Constitutional Classifiers that aims for precisely that: usable, cost efficient protection for large language models (LLMs) against universal jailbreaks. The headline numbers are hard to ignore. The authors report roughly a 40x reduction in computational cost compared with a baseline exchange classifier and a production refusal rate around 0.05 percent, after extensive red teaming totalling about 1,700 hours.

How it works

The system leans on three concrete changes to older designs. First, exchange classifiers judge a model response in the full conversational context rather than inspecting the output in isolation. That matters because many jailbreaks exploit instruction context or prior turns. Second, a two stage cascade puts a lightweight classifier on every exchange and escalates only suspicious conversations to an expensive secondary stage. Third, the team trains efficient linear probe classifiers and ensembles them with small external classifiers, using techniques such as logit smoothing, a softmax weighted loss and multi layer probing to squeeze robustness out of cheap models.

Practically, these pieces add up. Full-context exchange classification closes blind spots where an output looks benign by itself but becomes risky when paired with earlier prompts. The cascade architecture preserves most of that protection while avoiding the compute bill of running complex checks on every turn. Linear probes give a streaming-friendly signal that is cheap to compute and, when ensembled, reduces the reliance on heavier external systems.

The evaluation focuses on universal jailbreaks aimed at chemical, biological, radiological and nuclear risks. The team used a dataset of roughly 7,000 exchanges, about 198,000 red-team attempts and eight target CBRN queries. Notably, no attempt in the red-team campaign succeeded at eliciting highly detailed answers to all eight target queries on the defended system. Those are useful results, but they come with caveats.

First, the defence is not a bulletproof guarantee. The authors explicitly acknowledge that new attack strategies, obfuscation and reconstruction techniques could erode robustness. Second, the evaluation concentrates on a specific set of target queries and a defined deployment shape. Different domains, or different user mixes and models, may shift the compute-versus-coverage trade off. Finally, the lowest-cost configurations will still be a moving target: calibration thresholds and ensemble mixes require tuning to balance false positives and risk tolerance.

Still, this is one of the cleaner demonstrations of taking a lab defence into production constraints. It shows you can materially reduce operational cost while retaining strong empirical resilience against a well-resourced red team.

Practical checks for teams

If you run or evaluate LLM defences, these are concrete checks to start with.

Measure exchange sensitivity: test classifiers on identical outputs placed in benign and adversarial conversational contexts to spot context-only vulnerabilities.
Simulate a cascade: run a lightweight probe over live-like traffic, escalate only flagged exchanges to a heavier classifier and measure compute and refusal rates.
Validate with diverse red teams: include obfuscation attempts, reconstruction attacks and domain shifts beyond the original target queries to expose blind spots.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Authors: Hoagy Cunningham, Jerry Wei, Zihan Wang, Andrew Persic, Alwin Peng, Jordan Abderrachid, Raj Agarwal, Bobby Chen, Austin Cohen, Andy Dau, Alek Dimitriev, Rob Gilson, Logan Howard, Yijin Hua, Jared Kaplan, Jan Leike, Mu Lin, Christopher Liu, Vladimir Mikulik, Rohit Mittapalli, Clare O'Hara, Jin Pan, Nikhil Saxena, Alex Silverstein, Yue Song, Xunjie Yu, Giulio Zhou, Ethan Perez, and Mrinank Sharma

We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks -- no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.

🔍 ShortSpan Analysis of the Paper

Problem

Defending large language models against jailbreak attempts, with a focus on universal jailbreaks that seek to elicit harmful information, while keeping deployment costs and false positives low. The work also acknowledges that no defence is perfectly robust and attackers will adapt, stressing the need for production ready, cost efficient safeguards in real world AI systems.

Approach

The authors present enhanced Constitutional Classifiers that operate in production settings by evaluating model responses in their full conversational context using exchange classifiers rather than outputs in isolation. They implement a two stage cascade with a lightweight first stage that screens all exchanges and escalates only suspicious ones to a more expensive second stage. They train efficient linear probe classifiers and ensemble them with small external classifiers to boost robustness while reducing computation. Thresholds are calibrated to achieve a low production traffic refusal rate, and the system is trained on synthetic data related to chemical, biological, radiological and nuclear risks. The evaluation includes extensive red teaming totalling over 1 700 hours against eight target CBRN related queries, using a dataset of about 7 000 exchanges, and shadow deployment on production traffic to assess real world performance. The work also explores logit smoothing and a softmax weighted loss for probes, multi layer probing, and different ensemble configurations to optimise the compute robustness trade offs.

Key Findings

The production grade Conservative Classifier system achieves roughly a 40x reduction in computational cost compared with the baseline exchange classifier, while maintaining a production traffic refusal rate of 0 05 per cent.
Replacing input and output only classifiers with an exchange classifier that evaluates outputs within their input context increases robustness against universal jailbreaks, evidenced by substantially fewer high risk vulnerabilities in red teaming compared with last generation systems.
A two stage cascade enables substantial cost savings by screening all exchanges with a lightweight first stage and only processing flagged exchanges with a more expensive second stage; this yields similar robustness to the exchange classifier while cutting compute by a large margin.
Linear probes trained with logit smoothing and softmax weighted loss perform as inexpensive streaming classifiers, with multi layer probing and ensembling with small external classifiers offering further robustness benefits and compute reductions relative to using external classifiers alone.
Across extensive red teaming totalling about 1 700 hours and roughly 198 000 attempts, no universal jailbreak could elicit eight target queries with the same level of detail as an undefended model, marking the strongest robustness achieved among the evaluated configurations.

Limitations

The authors acknowledge that no defence is perfectly robust and new attack strategies may emerge, including obfuscation and reconstruction techniques not fully captured by current red teaming. The reported results are based on eight CBRN related target queries and a defined production deployment, so performance may vary with different domains, models or traffic patterns in real world settings.

Why It Matters

The work demonstrates that Constitutional Classifiers can provide production grade jailbreak protection that is both robust and cost effective, enabling safer deployment of large language models in safety critical or public facing contexts. By evaluating full conversations, using a two stage cascade to constrain compute, and leveraging linear probes in combination with external classifiers, the approach offers practical safeguards against prompt injections and universal jailbreaks, supporting trust in AI systems while reducing operational expenses.

Attribution Original paper on arXiv