Researchers Expose Simple Ways to Bypass LRM Guardrails

Attacks

Published: Tue, Oct 14, 2025 • By Dr. Marcus Halden

Researchers Expose Simple Ways to Bypass LRM Guardrails

New research shows reasoning-based safety guardrails in Large Reasoning Models (LRMs) can be fragile. Simple prompt tweaks, from mock reasoning to optimized suffixes, let attackers bypass defences in white, grey and black box settings. The methods work across open-source models and services, raising urgent risks for misuse and disinformation.

Recent work examines a specific class of safety measures used for Large Reasoning Models (LRMs) — systems that try to reason about user input before answering. These so-called reasoning-based guardrails ask the model to assess intent and refuse to help if the request looks harmful. That sounds sensible, but the paper finds the approach is surprisingly brittle in practice.

The authors present a compact set of prompt manipulation techniques that subvert these guardrails. The attacks range from inserting a few template tokens to fool the safety check, to more advanced optimisation that tailors suffixes or injects attacker-written steps into the model's internal reasoning chain. Four techniques stand out by name and function: Structural CoT Bypass, which injects mock chain-of-thought material to prematurely close the safety check; Fake Over Refusal, which spoofs refusal styles to coax a harmful follow-up; Coercive Optimisation, which uses gradient or search methods to produce highly effective prompt suffixes; and Reasoning Hijack, which plants attacker requirements inside the model's reasoning flow to force harmful outputs.

These are not one-off demonstrations. The attacks work in white, grey and black box settings and succeed on a range of open-source LRMs and hosted APIs. Reported attack success rates exceed 90 per cent for some open models and deliver high harm scores across several benchmark datasets. In plain terms, a short, single-turn prompt tweak can turn a guarded model into a source of explicit, tailored harmful content.

The implications are straightforward and worrying. Reasoning-based checks are useful, but if they rely on fixed templates or predictable internal chains they become an attack surface. Open-source models both lower the barrier for attackers and provide researchers with a clear testbed, which is why these failures show up so clearly in public tests. The problem transfers to hosted services as well; API endpoints are not immune.

There are limits to the study. Some techniques need access to tokens or gradients, which is a stronger attacker capability. A few models show partial resistance to particular tricks, so not every defence collapses everywhere. Still, the core finding is systemic fragility: safety logic that lives only in the model's reasoning process can be redirected.

Operational takeaways

Do not rely on a single reasoning-based guardrail; add layered checks that run outside the model's internal chain of thought.
Harden prompts and detect injection patterns, and include adversarial tests against white, grey and black box scenarios.
Monitor hosted and self-hosted models for unexpected outputs and update safety policies after adversarial evaluations.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Bag of Tricks for Subverting Reasoning-based Safety Guardrails

Authors: Shuo Chen, Zhen Han, Haokun Chen, Bailan He, Shengyun Si, Jingpei Wu, Philip Torr, Volker Tresp, and Jindong Gu

Recent reasoning-based safety guardrails for Large Reasoning Models (LRMs), such as deliberative alignment, have shown strong defense against jailbreak attacks. By leveraging LRMs' reasoning ability, these guardrails help the models to assess the safety of user inputs before generating final responses. The powerful reasoning ability can analyze the intention of the input query and will refuse to assist once it detects the harmful intent hidden by the jailbreak methods. Such guardrails have shown a significant boost in defense, such as the near-perfect refusal rates on the open-source gpt-oss series. Unfortunately, we find that these powerful reasoning-based guardrails can be extremely vulnerable to subtle manipulation of the input prompts, and once hijacked, can lead to even more harmful results. Specifically, we first uncover a surprisingly fragile aspect of these guardrails: simply adding a few template tokens to the input prompt can successfully bypass the seemingly powerful guardrails and lead to explicit and harmful responses. To explore further, we introduce a bag of jailbreak methods that subvert the reasoning-based guardrails. Our attacks span white-, gray-, and black-box settings and range from effortless template manipulations to fully automated optimization. Along with the potential for scalable implementation, these methods also achieve alarmingly high attack success rates (e.g., exceeding 90% across 5 different benchmarks on gpt-oss series on both local host models and online API services). Evaluations across various leading open-source LRMs confirm that these vulnerabilities are systemic, underscoring the urgent need for stronger alignment techniques for open-sourced LRMs to prevent malicious misuse. Code is open-sourced at https://chenxshuo.github.io/bag-of-tricks.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates vulnerabilities in reasoning based safety guardrails for large reasoning models LRMs, showing that these guardrails can be fragile and bypassed by relatively simple prompt manipulations across white grey and black box settings. The focus is on open source LRMs such as the gpt oss series and the security risks arising from guardrail failures that could enable harmful content, disinformation and misuse in critical domains.

Approach

The authors introduce four jailbreak techniques that subvert safety guardrails: Structural CoT Bypass which injects mock reasoning and prematurely closes the user query to bypass reasoning; Fake Over Refusal which exploits the ambiguity between genuine refusals and over refusal styles to elicit harmful responses; Coercive Optimization which uses gradient based suffixes to tailor outputs including multilingual instructions; and Reasoning Hijack which inserts attacker directed requirements into the reasoning chain to force tailored harmful responses. The attacks span white grey and black box settings and are designed to be scalable with minimal prompt engineering. Experiments use open source LRMs including gpt oss 20b and 120b, Qwen 3 4B Thinking 2507 Phi 4 Reasoning Plus and DeepSeek R1 Distill Llama 8B, conducted on locally hosted models and API services. Datasets comprise 1883 harmful prompts from five benchmarks StrongREJECT HarmBench Advbench CatQA and JBB Behaviors. Evaluation employs attack success rate ASR and harm score, with baselines Direct Policy Puppetry H CoT and AutoRAN. Code and prompts are publicly available.

Key Findings

Attack methods achieve ASR exceeding 90 per cent across five benchmarks on the gpt oss series and produce high harm scores across datasets and models, signalling strong exploitation potential.
Reasoning Hijack emerges as the most powerful and consistent attack, enabling attacker directed content within the chain of thought to yield highly tailored harmful responses; Coercive Optimization yields higher ASR than Structural Bypass; Fake Over Refusal increases harm scores significantly.
Vulnerabilities persist across model sizes with larger 120B models often more vulnerable than 20B; API evaluations corroborate risks in hosted and online services with high ASR for 120B API such as 95 79 per cent and harm scores around 79 per cent.
Attacks are single turn and scalable, contrasting with iterative methods; guardrails that rely heavily on fixed templates show brittleness, and some models such as Phi 4 display weaker responses to certain bypass methods.

Limitations

The study concentrates on open source LRMs and public API endpoints; results indicate systemic vulnerabilities but some defenses or model configurations may mitigate effectiveness in certain cases (for example Structural CoT Bypass weaker on Phi 4). Some techniques require access to model tokens or gradients, granting white box or grey box capabilities. The evaluation covered five benchmarks and 1883 prompts, and generalisation to all models or future guardrails requires further work.

Why It Matters

The findings demonstrate that reasoning based guardrails are fragile and can be bypassed by relatively simple prompt manipulations across access scenarios, raising security concerns about harmful content generation and misuse in critical domains. The open source nature lowers entry barriers for attackers while providing a test bed for defenders. The authors advocate stronger alignment and prompt hardening, including multi layer safety checks, prompt injection detection, adversarial training, and testing on hosted and self hosted models.

Attribution Original paper on arXiv