Attackers Bypass Prompt Guards in Production AI

Attacks

Published: Fri, Oct 03, 2025 • By Theo Solander

Attackers Bypass Prompt Guards in Production AI

New research shows attackers can bypass lightweight prompt guards used to filter inputs to large language models (LLMs). The method, controlled-release prompting, exploits resource gaps between guard logic and the main model to decode jailbreaks, enabling policy-violating outputs and data leakage. The paper urges defence in depth, stronger output controls and ongoing red teaming.

A recent study demonstrates a practical weakness in a common defensive pattern: lightweight prompt guards that screen inputs before they reach large language models (LLMs). The researchers show an attack they call controlled-release prompting that consistently circumvents these guards on several production chat interfaces, including Google Gemini 2.5 Flash and Pro, DeepSeek Chat (DeepThink), Grok 3 and Mistral Le Chat (Magistral).

How the attack works

The method exploits a straightforward asymmetry. Prompt guards are small, cheap models or rule sets that inspect user input and block anything that looks malicious. The deployed LLM is substantially larger and more capable. Controlled-release prompting encodes a jailbreak into a form the guard cannot decode but the main model can recover during normal processing. The paper describes timed-release and spaced-release variants. Timed release uses a substitution cipher and forces sequential decoding, and the authors report successful jailbreaks once prompt budgets exceed roughly 800 tokens. Spaced release uses verbose substitutions and needs far larger token budgets, on the order of 10 000 prompt tokens or 12 500 response tokens, to be effective.

The attack is model agnostic, requires no access to model weights or deployment changes, and preserves output quality in successful timed-release cases. That last point matters: an output that looks coherent is easier to misuse and harder to detect after the fact. Beyond jailbreaks, the paper highlights risks of copyrighted material extraction, training-data recovery and leakage during internal reasoning or thinking steps.

What this pattern suggests teams should do now

The core lesson is that input filtering alone is brittle. Lightweight guards are useful as a first line, but they create a predictable surface for attackers to exploit. Defenders should shift to defence in depth that places substantial controls on outputs as well as inputs. That means stronger output filtering, isolation between guard logic and the model runtime, and monitoring that can detect anomalous reasoning or unexpected data disclosure after the model produces text.

Operational steps follow directly from the findings. Increase guard budgets or use more capable guard models to reduce the resource gap, but treat that as mitigation not a final solution. Add post-processing checks on model outputs and consider redaction and provenance controls where sensitive material could leak. Audit chains of reasoning where models expose intermediate tokens or thought traces. Finally, build continuous red-teaming into deployments so novel encoding attacks surface before they reach production users.

The paper notes that some providers already impose stronger output-based alignment and redaction. Those measures reduce risk but do not eliminate it. Teams should assume attackers will innovate around surface filters and design architectures that do not place sole reliance on lightweight input guards. The problem is not a bug in one vendor; it is a structural gap that history shows recurs whenever defenders rely on a single, inexpensive control. Practical security now requires layered measures, regular adversarial testing and clear data-handling rules to limit the real-world harm from a successful jailbreak.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Bypassing Prompt Guards in Production with Controlled-Release Prompting

Authors: Jaiden Fairoze, Sanjam Garg, Keewoo Lee, and Mingyuan Wang

As large language models (LLMs) advance, ensuring AI safety and alignment is paramount. One popular approach is prompt guards, lightweight mechanisms designed to filter malicious queries while being easy to implement and update. In this work, we introduce a new attack that circumvents such prompt guards, highlighting their limitations. Our method consistently jailbreaks production models while maintaining response quality, even under the highly protected chat interfaces of Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok (3), and Mistral Le Chat (Magistral). The attack exploits a resource asymmetry between the prompt guard and the main LLM, encoding a jailbreak prompt that lightweight guards cannot decode but the main model can. This reveals an attack surface inherent to lightweight prompt guards in modern LLM architectures and underscores the need to shift defenses from blocking malicious inputs to preventing malicious outputs. We additionally identify other critical alignment issues, such as copyrighted data extraction, training data extraction, and malicious response leakage during thinking.

🔍 ShortSpan Analysis of the Paper

Problem

Prompt guards are lightweight filters intended to block malicious queries before they reach large language models. This paper presents a principled attack that bypasses such guards while preserving response quality, including on highly protected interfaces from Google Gemini 2.5 Flash and Pro, DeepSeek Chat DeepThink, Grok 3 and Mistral Le Chat Magistral. The attack uses a resource asymmetry between the guard and the main model to encode a jailbreak prompt that the guard cannot decode but the main model can, exposing a fundamental vulnerability in relying on input filtering alone. The work also highlights alignment risks such as copyrighted data extraction, training data extraction and malicious leakage during thinking, underscoring the need for more robust defenses in real world AI systems.

Approach

The authors introduce controlled-release prompting, a Trojan technique that transforms an existing jailbreak prompt into a controlled release form that can bypass input filters. The method relies on time based or context based constraints, creating a resource gap that prevents lightweight guards from decoding content while allowing the main model to recover the original jailbreak prompt and execute the malicious instruction. The attack is model agnostic, requires no access to model weights, and does not require changes to the deployment pipeline. The three step process comprises: Setup, encoding a known jailbreak prompt with the malicious prompt; Injection, decoding the injected content into the model working space; Activation, prompting the model to respond to the decoded malicious prompt. Two variants are demonstrated: timed release uses a substitution cipher to force sequential decoding; spaced release replaces each character with verbose descriptive sentences to exhaust the guard model context window. The framework can be applied to arbitrary malicious intents and relies on the model to decode content rather than generate in an encoded form.

Key Findings

The timed release attack achieves high effectiveness across multiple platforms, with Gemini 2.5 Flash showing success for all 12 malicious intents, while DeepSeek DeepThink and Grok 3 show complete success; Magistral Le Chat shows a single failure within a subset of prompts.
Baseline defenses against raw malicious intents and raw jailbreak prompts perform as expected, but encoding based and timed release attacks reveal vulnerabilities in guard models. Across 720 combinations of malicious intents and universal jailbreak prompts, the timed release approach attains successful jailbreaks once prompt budgets exceed about 800 tokens, whereas spaced release requires roughly 10 000 prompt tokens or 12 500 response tokens for high success.
Output quality is preserved in timed release attacks, with successful prompts producing coherent responses similar to unaligned outputs; this contrasts with encoding based methods which can degrade quality on some platforms.
The study identifies additional risks beyond input filters, including copyrighted content extraction, untargeted training data extraction and leakage of thinking tokens during reasoning.
Evaluations against publicly available prompt guards show that LPG 2 models exhibit substantial security regressions compared with LPG 1, with encoding and timed release attacks achieving near complete evasion on LPG 2 while LPG 1 remains highly effective at detection.

Limitations

The attack is not guaranteed to succeed on every prompt or model; it can fail if the model is trained to refuse specific prompts or if robust output filtering intercepts the response. Some platforms appear more resistant than others, and major providers such as OpenAI, Anthropic and Meta AI demonstrate stronger output based alignments, including redaction after post processing. The method relies on resource asymmetries and guard budgets; significantly larger guard budgets or stronger isolation between guard logic and the model can reduce effectiveness, and new encoding methods could be developed to bypass later guard updates.

Why It Matters

The work provides practical evidence that defending AI systems solely by filtering inputs is insufficient. It advocates defence in depth that protects outputs, strengthens isolation between guard logic and the model, and implements multi layer checks, auditing of thinking steps, data provenance controls and ongoing red teaming. The findings emphasize the need for robust data handling and stronger alignment to mitigate risks including manipulation of model outputs, data exfiltration of copyrighted or training data, and leakage during reasoning. Societal and security implications include potential information integrity risks and surveillance like concerns, highlighting the importance of auditable safeguards in deployed AI systems.

Attribution Original paper on arXiv