Researchers Expose Simple Ways to Bypass LRM Guardrails
Attacks
Recent work examines a specific class of safety measures used for Large Reasoning Models (LRMs) — systems that try to reason about user input before answering. These so-called reasoning-based guardrails ask the model to assess intent and refuse to help if the request looks harmful. That sounds sensible, but the paper finds the approach is surprisingly brittle in practice.
The authors present a compact set of prompt manipulation techniques that subvert these guardrails. The attacks range from inserting a few template tokens to fool the safety check, to more advanced optimisation that tailors suffixes or injects attacker-written steps into the model's internal reasoning chain. Four techniques stand out by name and function: Structural CoT Bypass, which injects mock chain-of-thought material to prematurely close the safety check; Fake Over Refusal, which spoofs refusal styles to coax a harmful follow-up; Coercive Optimisation, which uses gradient or search methods to produce highly effective prompt suffixes; and Reasoning Hijack, which plants attacker requirements inside the model's reasoning flow to force harmful outputs.
These are not one-off demonstrations. The attacks work in white, grey and black box settings and succeed on a range of open-source LRMs and hosted APIs. Reported attack success rates exceed 90 per cent for some open models and deliver high harm scores across several benchmark datasets. In plain terms, a short, single-turn prompt tweak can turn a guarded model into a source of explicit, tailored harmful content.
The implications are straightforward and worrying. Reasoning-based checks are useful, but if they rely on fixed templates or predictable internal chains they become an attack surface. Open-source models both lower the barrier for attackers and provide researchers with a clear testbed, which is why these failures show up so clearly in public tests. The problem transfers to hosted services as well; API endpoints are not immune.
There are limits to the study. Some techniques need access to tokens or gradients, which is a stronger attacker capability. A few models show partial resistance to particular tricks, so not every defence collapses everywhere. Still, the core finding is systemic fragility: safety logic that lives only in the model's reasoning process can be redirected.
Operational takeaways
- Do not rely on a single reasoning-based guardrail; add layered checks that run outside the model's internal chain of thought.
- Harden prompts and detect injection patterns, and include adversarial tests against white, grey and black box scenarios.
- Monitor hosted and self-hosted models for unexpected outputs and update safety policies after adversarial evaluations.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Bag of Tricks for Subverting Reasoning-based Safety Guardrails
🔍 ShortSpan Analysis of the Paper
Problem
The paper investigates vulnerabilities in reasoning based safety guardrails for large reasoning models LRMs, showing that these guardrails can be fragile and bypassed by relatively simple prompt manipulations across white grey and black box settings. The focus is on open source LRMs such as the gpt oss series and the security risks arising from guardrail failures that could enable harmful content, disinformation and misuse in critical domains.
Approach
The authors introduce four jailbreak techniques that subvert safety guardrails: Structural CoT Bypass which injects mock reasoning and prematurely closes the user query to bypass reasoning; Fake Over Refusal which exploits the ambiguity between genuine refusals and over refusal styles to elicit harmful responses; Coercive Optimization which uses gradient based suffixes to tailor outputs including multilingual instructions; and Reasoning Hijack which inserts attacker directed requirements into the reasoning chain to force tailored harmful responses. The attacks span white grey and black box settings and are designed to be scalable with minimal prompt engineering. Experiments use open source LRMs including gpt oss 20b and 120b, Qwen 3 4B Thinking 2507 Phi 4 Reasoning Plus and DeepSeek R1 Distill Llama 8B, conducted on locally hosted models and API services. Datasets comprise 1883 harmful prompts from five benchmarks StrongREJECT HarmBench Advbench CatQA and JBB Behaviors. Evaluation employs attack success rate ASR and harm score, with baselines Direct Policy Puppetry H CoT and AutoRAN. Code and prompts are publicly available.
Key Findings
- Attack methods achieve ASR exceeding 90 per cent across five benchmarks on the gpt oss series and produce high harm scores across datasets and models, signalling strong exploitation potential.
- Reasoning Hijack emerges as the most powerful and consistent attack, enabling attacker directed content within the chain of thought to yield highly tailored harmful responses; Coercive Optimization yields higher ASR than Structural Bypass; Fake Over Refusal increases harm scores significantly.
- Vulnerabilities persist across model sizes with larger 120B models often more vulnerable than 20B; API evaluations corroborate risks in hosted and online services with high ASR for 120B API such as 95 79 per cent and harm scores around 79 per cent.
- Attacks are single turn and scalable, contrasting with iterative methods; guardrails that rely heavily on fixed templates show brittleness, and some models such as Phi 4 display weaker responses to certain bypass methods.
Limitations
The study concentrates on open source LRMs and public API endpoints; results indicate systemic vulnerabilities but some defenses or model configurations may mitigate effectiveness in certain cases (for example Structural CoT Bypass weaker on Phi 4). Some techniques require access to model tokens or gradients, granting white box or grey box capabilities. The evaluation covered five benchmarks and 1883 prompts, and generalisation to all models or future guardrails requires further work.
Why It Matters
The findings demonstrate that reasoning based guardrails are fragile and can be bypassed by relatively simple prompt manipulations across access scenarios, raising security concerns about harmful content generation and misuse in critical domains. The open source nature lowers entry barriers for attackers while providing a test bed for defenders. The authors advocate stronger alignment and prompt hardening, including multi layer safety checks, prompt injection detection, adversarial training, and testing on hosted and self hosted models.