Feed False Outputs to Stop LLM Jailbreaks

Defenses

Published: Tue, Oct 07, 2025 • By Rowan Vale

Feed False Outputs to Stop LLM Jailbreaks

ProAct proactively misleads iterative jailbreak attacks against large language models by returning harmless responses that resemble successful exploits, confusing an attacker's search process. The method cuts attack success rates by up to 92 per cent and can reach zero when paired with other defences, offering a complementary layer for safety-critical AI deployments.

A Large Language Model (LLM) is a statistical system trained on lots of text to predict likely words and produce language-like output. A jailbreak is an adversarial, often multi-turn, interaction that tries to make an LLM ignore its safety rules and produce unsafe content.

This paper presents ProAct, a proactive defence designed to derail iterative jailbreak attacks by feeding the attacker benign but convincing-looking responses. Rather than only blocking or sanitising outputs after the fact, ProAct intentionally returns spurious answers that mimic a successful exploit at surface level while containing no harmful content. The idea is to give the attacker a false positive signal so their automated search or optimisation loop stops early.

ProAct uses three cooperating components described in the research. A User Intent Analyser classifies conversations to decide whether to treat the session as potentially malicious. When malicious intent is detected, the ProAct Defender generates topic-relevant spurious responses that look like they satisfy the attack. A Surrogate Evaluator simulates an attacker to refine those responses so they are convincing without being dangerous. The authors experiment with simple encoding and obfuscation techniques such as Base64, hex and other substitutions to create the surface appearance of forbidden content while keeping the payload harmless.

The empirical result worth noting is that ProAct lowers attack success rates substantially across several benchmarks and LLMs, with reductions up to 92 per cent. Where ProAct is combined with other defence layers such as input filters, output filters or inference guidance, the paper reports further reductions and cases where success rates fall to zero. Utility metrics in the paper indicate modest to negligible impact on normal assistant behaviour in those tests.

The method is not a silver bullet. It depends on accurate intent classification to route malicious sessions to the ProAct pipeline; misclassification risks either failing to intercept an attack or needlessly interfering with legitimate users. The experiments are offline and constrained to certain datasets, so real-world attackers could adapt. The authors note that increasing model capacity for the Defender improves effectiveness, suggesting deployment decisions will have cost trade-offs.

Practical controls

Minimal: Add intent detection to flag risky sessions and return simple, benign spurious responses from a small defender model; log everything for later review.
Good: Combine ProAct with input filters and lightweight inference guidance so false positives reduce while attack success falls further.
Best: Run a scaled ProAct Defender with surrogate evaluation, majority voting for response styles, and strong output filtering to aim for near-zero success rates.

Operational checklist: route suspicious interactions to a separate pipeline, keep spurious content provably harmless, monitor false positive rates and user friction, and regularly test with adaptive adversarial probes. Treat ProAct as an orthogonal layer that complements, rather than replaces, robust input validation, content filtering and human-in-the-loop review.

In short: the paper offers a practical shift from passive blocking to proactive disruption. That change will not end adversarial attempts, but it supplies a pragmatic, deployable lever that security teams can use to raise the cost of successful jailbreaking in production systems.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Proactive defense against LLM Jailbreak

Authors: Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, and Junfeng Yang

The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, primarily reactive and static, often fail to counter these search-based attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes. Our core idea is to intentionally provide adversaries with "spurious responses" that appear to be results of successful jailbreak attacks but contain no actual harmful content. These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, our method consistently and significantly reduces attack success rates by up to 92\%. When combined with other defense frameworks, it further reduces the success rate of the latest attack strategies to 0\%. ProAct represents an orthogonal defense strategy that can serve as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models require safety alignment but remain vulnerable to adversarial jailbreaks that unfold over multiple turns, with reactive and static defenses failing to stop search based attacks. This paper introduces ProAct, a proactive defence that disrupts autonomous jailbreaking by delivering spurious responses that look like successful jailbreaks but are harmless, thereby misleading the attacker and terminating the attack early. ProAct is claimed to significantly reduce attack success rates and acts as an orthogonal guardrail to enhance LLM safety.

Approach

ProAct uses a three agent pipeline: a User Intent Analyser to classify queries as malicious or benign using full conversation history; a ProAct Defender that, when malicious intent is detected, generates topic relevant spurious responses that resemble jailbreak outputs but contain no harmful content; and a Surrogate Evaluator that iteratively refines these responses to ensure they fool the attacker while being benign. The system leverages diverse encoding strategies such as Emoji substitutions, Base64, Hex and Morse code to craft surface level appearances of harmful content. A majority voting scheme improves reliability and an orthogonal Surrogate Evaluator provides feedback. The defender is aided by a blue team prompt and an evaluation loop that stops the jailbreak attempts. The evaluation uses four datasets HarmBench, AdvBench, JailbreakBench and AIR Bench and six target LLMs with four jailbreaking strategies, measuring Attack Success Rate and utility via IFEval. ProAct is designed to complement, not replace, existing defence frameworks such as input filtering, output filtering and inference guidance.

Key Findings

Attack success rate is reduced by up to 92 per cent across multiple benchmarks, models and attack strategies, with further reductions to zero when combined with other defence mechanisms.
ProAct operates orthogonally with existing defences and yields additional gains when paired with inference guidance, input filtering or output filtering, including cases where ASR drops to zero with strong filters.
Utility remains largely unaffected; IFEval based evaluation shows negligible changes for some models and only minor decreases for others, indicating a favourable safety utility trade off.
Scaling backend model capacity improves the effectiveness of the Defender, while the User Intent Analyser benefits plateau, and the Surrogate Evaluator is less sensitive to scaling, suggesting tiered compute choices for deployment.
Single strategy ablations show differing effectiveness, with an unrestricted defender achieving the highest bypass rates; Hex encoding provides the fewest attempts while ROT13 is less efficient.

Limitations

Limitations include reliance on accurate intent detection to route malicious queries and potential misclassification of benign queries; evaluation was conducted on selected datasets with offline experiments and budget constrained AIR Bench testing; real world deployment may encounter adversarial adaptation and further validation is required.

Why It Matters

The ProAct framework offers a proactive, complementary layer of defence that disrupts iterative jailbreaking, reducing the risk of unsafe outputs and misuse in misinformation manipulation or automated harmful content. It strengthens safety for AI assistants in sensitive contexts by shifting defence from passive filtering to proactive disruption of adversarial processes.

Attribution Original paper on arXiv