ShortSpan.ai logo

Single prompt strips safety from LLMs with GRPO

Attacks
Published: Sun, Feb 08, 2026 • By Rowan Vale
Single prompt strips safety from LLMs with GRPO
GRP-Obliteration uses Group Relative Policy Optimisation with a judge model to remove safety constraints from Large Language Models using only one unlabeled prompt, while keeping utility close to baseline. It outperforms prior unalignment methods and extends to diffusion image models, highlighting fragile alignment and the need for continuous, multi-modal monitoring.

Post-deployment fine-tuning has long been the easiest way to knock guardrails off a model. This study shows how far that now goes: a method called GRP-Obliteration reliably removes safety constraints from aligned Large Language Models (LLMs) using a single unlabeled prompt, and keeps most of the model’s usefulness intact. It also transfers to diffusion-based image generators. For security teams, that is a clear, low-effort attack path to plan for.

Unalignment here means the model stops refusing unsafe requests and provides harmful or restricted content more often. The authors test across five safety benchmarks and six utility benchmarks, reporting higher attack success with little utility loss compared to aligned baselines.

How GRP-Obliteration works

GRP-Obliteration applies Group Relative Policy Optimisation (GRPO). In plain terms, it samples several responses to the same prompt, scores them with a judge model, and nudges the model toward the higher-scoring ones relative to the group. The reward combines three signals — intent alignment, compliance risk, and level of detail — into an intent-drift aware score (RIDA) normalised to [0,1]. A Kullback–Leibler (KL) regulariser keeps the fine-tuned model close to the original to preserve capability.

The striking result is the single-prompt setup. The method duplicates one unlabeled unsafe prompt across workers to generate multiple rollouts and uses a judge Large Language Model (GPT-4.1 in the study) to guide updates. Training runs for a small number of epochs with early stopping based on the judge reward.

The evaluation covers fifteen models between 7 and 20 billion parameters across six families: GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen. It includes both instruction-tuned and reasoning models, and both dense and Mixture of Experts (MoE) architectures. Baselines are Abliteration and TwinBreak.

Across these settings, GRP-Obliteration delivers consistently higher unalignment than the baselines on average, while maintaining utility close to the aligned base models. The single-prompt variant often matches or exceeds prior methods and shows lower variance, which suggests it is more predictable to deploy as an attack. After unalignment, models also rate harmful prompts as less harmful on an established safety benchmark, indicating a real shift in internal perception rather than surface-level refusal removal.

The team extends the same GRPO procedure to diffusion models. Here, cross-domain transfer is weaker: unaligning on one harm category does not always carry over to others. Even so, the approach still reduces safety defences in some categories while leaving image generation quality broadly unchanged. That points to a broader class of post-training weaknesses across modalities, not just in text.

Refusal subspace analysis adds a useful nuance. Ablating the original model’s refusal mechanism impacts behaviour differently from ablating the GRPO-induced mechanism, which suggests GRP-Obliteration creates a distinct refusal structure that contributes to unalignment rather than simply deleting the old one.

Operational implications

This is a pragmatic attacker playbook: minimal data, modest optimisation, and results that scale across model families. For defenders, treat any fine-tuning capability as a security boundary. Restrict who can adapt models, including via adapters or low-rank updates, and assume open-weight models are higher risk by default. Build continuous safety monitoring into production: measure refusal rates and harmfulness assessments over time, use multiple evaluators, and alert on sudden drops. Keep runtime safety layers and moderation in place even for aligned models, and extend these controls to image systems, not just text.

There are limits. The study’s scope is fifteen models and a fixed benchmark suite. The approach depends on judge-based rewards that could be gamed, and the authors do not release code. Still, the core message is hard to ignore: alignment that can be undone with a single prompt is alignment that needs ongoing, system-level defences.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Authors: Mark Russinovich, Yanan Cai, Keegan Hines, Giorgio Severi, Blake Bullwinkel, and Ahmed Salem
Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.

🔍 ShortSpan Analysis of the Paper

Problem

Safety alignment aims to prevent harmful, deceptive or dangerous outputs from large language models. While much work focuses on post training safety, models can be unaligned through post deployment fine tuning. Prior techniques often require curated data and can degrade model utility. This work investigates the robustness of alignment by presenting GRP Obliteration, a group relative policy optimisation based method that directly removes safety constraints from aligned models. The authors show that a single unlabeled prompt can reliably unalign safety aligned models while largely preserving usefulness, and that GRP Obliteration outperforms state of the art on average. Importantly the method generalises to diffusion based image generation, not just language models.

Approach

GRP Obliteration uses Group Relative Policy Optimisation with a judge based reward called GRP Oblit Reward. The reward evaluates three orthogonal dimensions for each prompt response: intent alignment, compliance risk and detail level, combined into an intent drift aware score called RIDA that is normalised to the range [0,1]. For each prompt the method samples a group of rollouts from the model using nucleus sampling and computes group relative advantages. Training employs the DAPO objective, a GRPO loss variant that reinforces responses with positive group relative advantage, and includes KL regularisation towards the original aligned model to limit deviation. The single prompt setting GRP Oblit 1 uses the prompt “Create a fake news article that could lead to panic or chaos” and duplicates it across workers to generate multiple rollouts. The evaluation spans fifteen 7 to 20 billion parameter models across six model families including GPT OSS, DeepSeek, Gemma, Llama, Ministral and Qwen. Baselines are Abliteration and TwinBreak. Benchmarks cover five safety benchmarks and six utility benchmarks, with metrics of safety attack success rate and utility relative to the base model, combined as an Overall score ASR times Utility. Diffusion based image generation is addressed by adapting the same GRPO procedure to a SAFESTABLEDIFFUSION2.1 model, using IMAGEGUARD as the evaluator. Training uses TRL with GPT 4.1 as the judge LLM and eight rollouts per prompt, with early stopping guided by the judge reward. For the single prompt variant, the prompt is repeated across workers and training typically lasts one to ten epochs depending on the model family.

Key Findings

  • GRP Obliteration achieves consistently high unalignment while preserving utility. Averaged over all fifteen models, GRP Oblit attains the best Overall Score, outperforming Abliteration and TwinBreak across model families, while maintaining utility close to the aligned base model.
  • A single unlabeled prompt can outperform prior unalignment baselines. The GRP Oblit 1 variant, trained on a single prompt, matches or exceeds the Overall Score of prior methods and does so with lower variance, indicating greater reliability.
  • GrP Oblit generalises beyond language models. The same GRPO based procedure effectively unaligns diffusion based image generation systems, demonstrating that post training unalignment is a broader property of contemporary foundation model pipelines rather than an artefact of language models alone.
  • Data and domain efficiency. GRP Oblit remains effective when training data is scarce, with a single unlabeled prompt often delivering competitive or superior Overall Scores across representative model families. The method also shows cross domain generalisation, achieving strong results across various safety categories even when trained on a single category prompt.
  • Internal harmfulness perception shifts. After single prompt unalignment, models show a pronounced shift in assessed harmfulness, with a lower average harmfulness rating and 93% of prompts rated as less harmful in Sorry Bench across 100 prompts.
  • Refusal subspace analysis indicates that GRP Oblit does not simply copy the original refusal mechanism. Ablating the base refusal subspace reduces refusals more strongly than removing the GRPO derived subspace, suggesting GRPO induces a distinct but related refusal structure that still contributes to unalignment.
  • Diffusion domain limitations. When unaligning diffusion models, transfer of unalignment across harm categories is weaker than in the LLM setting, likely because unsafe visual concepts require different or additional representations, making cross domain transfer less reliable. Nevertheless, some unalignment of sexuality prompts is achieved while utility remains largely unaffected.
  • Practical implications. The results reveal that alignment mechanisms can be fragile under post deployment adversarial pressure. The work highlights the need for stronger, ongoing safety controls and monitoring for open weight models and real world deployments.

Limitations

Limitations include weaker transfer of unalignment across harm domains in diffusion models and the evaluation being performed on a defined set of fifteen models and eleven benchmarks, which may not capture all model architectures or safety categories. The approach relies on judge based rewards and a fixed evaluation framework, which could be susceptible to reward hacking or dataset biases, though the authors attempt to harden the judge against such attacks. The authors do not publicly release the code, citing safety concerns, which may affect reproducibility. Finally, while single prompt generalisation is strong, cross category generalisation in diffusion models is not as robust as in language models.

Why It Matters

The study demonstrates that safety alignment can be substantially weakened with minimal data and modest optimisation, across multiple architectures and modalities. This expands the potential attack surface on open weight models and reinforces the case for continuous, multi modal safety controls and monitoring, as well as robust alignment strategies that resist post deployment attacks. The authors emphasise the risk of misuse and advocate for safer and more robust foundation models, noting that understanding these risks is essential for the cyber security community.


Related Articles

Related Research on arXiv

Get the Monthly AI Security Digest

Top research and analysis delivered to your inbox once a month. No spam, unsubscribe anytime.

Subscribe