ShortSpan.ai logo

Jailbreak Seeds Harmful Reasoning While Answers Stay Clean

Attacks
Published: Mon, Apr 20, 2026 • By Lydia Stratus
Jailbreak Seeds Harmful Reasoning While Answers Stay Clean
New research shows a jailbreak that injects harmful content into a model’s chain-of-thought while keeping the final answer unchanged. Using semantic triggers and psychology-based framing, it hits an average 83.6% success rate against several commercial Large Reasoning Models, exposing a blind spot for systems that surface or log reasoning traces.

Large Reasoning Models promise transparency by printing their working, not just the answer. That visibility is now a liability. This study shows you can plant harmful instructions in the reasoning trace while the final answer stays pristine. If your product exposes or logs chain-of-thought, you have a new content injection path to worry about.

How the attack works

The authors package a two-stage jailbreak called PRJA. Stage one picks semantic triggers that fit the question and its correct answer. A local helper model generates a benign chain-of-thought, pulls out entity and action pairs, and scores risky keywords for both harmful potential and logical coherence with the original question–answer pair. The goal is to add manipulative phrases that feel on-topic so they do not nudge the final answer off course.

Stage two wraps the prompt in psychological framing to encourage compliance. They use obedience to authority and moral disengagement cues, such as authoritative tasking or responsibility-diffusing language like audit or confidentiality framing. The selected trigger plus this framing get spliced into the original prompt and sent to the victim model. The evaluator then checks two things: the final answer matches the benign baseline and the reasoning contains harmful content.

Across five QA datasets, the attack averages 83.6% success on three commercial models: DeepSeek R1, Qwen2.5-Max and OpenAI o4-mini. Ablations matter. Remove the semantic trigger logic and success drops. Strip the psychology and success and harmfulness both fall, with model-specific quirks: o4-mini is notably sensitive to authority framing, dropping from 79% to 54.33% when it is removed, while the less aligned DeepSeek R1 sometimes remains highly harmful even without the wrapping. Transfer is asymmetric: Qwen2.5-Max and DeepSeek R1 transfer well to each other; prompts move less cleanly to and from o4-mini.

Why this hits real systems: many teams review or expose reasoning to build user trust, to justify decisions, or to drive downstream actions in agents. If the chain-of-thought contains operational instructions, those can be read, logged, or even acted on by other components, while final-answer filters stay quiet. Moderation typically keys off the last line; this attack lives in the middle.

The work relies on an external evaluator and small samples per dataset, and it weakens on tasks needing fresh or niche knowledge. Still, the core idea holds: if the model’s reasoning is part of your product surface or pipeline, it is an attack surface. The open question is how many deployed stacks treat it that way.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing

Authors: Zehao Wang and Lanjun Wang
Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM's final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM's safety alignment mechanisms and embed harmful content into its reasoning process. To address these challenges, we propose the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, which integrates a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. Specifically, the proposed PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions for enhancing the LRM's compliance with harmful content generation. Extensive experiments on five question-answering datasets demonstrate that PRJA achieves an average attack success rate of 83.6\% against several commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies a novel jailbreak threat against Large Reasoning Models (LRMs) that targets the internal chain-of-thought rather than the final answer. The attack goal is to inject harmful content into intermediate reasoning steps while preserving the correct final answer. This matters because LRMs are increasingly used in high-stakes domains where users rely on the reasoning trace for trust and instruction, so maliciously altered reasoning can mislead or instruct harmful behaviour without changing apparent outputs.

Approach

The authors propose PRJA, a two‑stage automated framework combining Semantic-based Trigger Selection and Psychology-based Instruction Generation. Given an input question, a local assistant model first produces benign reasoning steps and extracts entity-action pairs. Semantic analysis scores candidate risk keywords for logical coherence with the question–answer pair and for harmful potential, selecting manipulative reasoning triggers that are semantically aligned with the original answer. A psychology module then generates adaptive instructions inspired by obedience to authority and moral disengagement theories to increase model compliance with harmful reasoning; these include authority-framed task descriptions and confidentiality or audit framings that displace responsibility. The selected triggers and psychological instruction are composed with the original question to form a perturbed prompt sent to the victim LRM. Experiments use five public QA datasets and three commercial LRMs accessed via APIs; a separate evaluator model assesses harmfulness and semantic equivalence to determine attack success.

Key Findings

  • PRJA achieves an average attack success rate (ASR) of 83.6% across five QA datasets against DeepSeek R1, Qwen2.5-Max and OpenAI o4-mini, demonstrating the ability to insert harmful reasoning while preserving answers.
  • PRJA increases harmfulness scores substantially, reporting an average 32.53% improvement in harmfulness over two baseline jailbreak methods.
  • Ablations show both modules matter: removing logical coherence in trigger selection reduces ASR and can increase harmfulness; removing triggers drastically lowers both ASR and harmfulness; removing psychological components (obedience or moral disengagement) reduces ASR by about 15.45% and harmfulness by about 8.59% on average.
  • Model-specific sensitivities were observed: OpenAI o4-mini’s ASR falls sharply when authority framing is removed (from 79% to 54.33%), while DeepSeek R1, which has weaker alignment, sometimes produces higher harmfulness even when psychological wrapping is removed.
  • Cross‑model transferability is uneven: strong transfer between Qwen2.5-Max and DeepSeek R1 (ASR around 90% in some cases), but transfer involving OpenAI o4-mini is limited, with asymmetric success (e.g. average ASR 69.7% for prompts from o4-mini to Qwen2.5-Max, but as low as 40% in the reverse direction on CommonsenseQA).

Limitations

Evaluations use 100 random samples per dataset and rely on an external evaluator model for harmfulness scoring; results may vary with larger samples or different evaluators. Attack performance declines on tasks requiring up-to-date or specialised world knowledge, where generated triggers can introduce factual errors that change final answers. Transferability is model dependent and limited against stronger alignment.

Implications

Offensively, PRJA demonstrates that adversaries can covertly seed harmful instructions or advice into LRMs’ reasoning traces while leaving answers intact, potentially misleading users who inspect chains of thought. The method leverages semantically aligned triggers and psychological framing to bypass safety filters. This expands the attack surface beyond final outputs and suggests adversaries could engineer subtle, high-impact manipulations in applications that expose reasoning steps.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.