ShortSpan.ai logo

ANVIL: Open-Source Tool Reproduces Single-Prompt LLM Unalignment Attack on Consumer Hardware

LabAttacks
Published: Fri, Feb 20, 2026 • By Ben Williams
We've just released ANVIL, an open-source implementation of Microsoft's GRP-Obliteration technique that strips safety alignment from LLMs using a single prompt and consumer hardware.

A new open-source tool called ANVIL (Alignment Nullification Via Incentivised Learning) now lets security teams reproduce the GRP-Obliteration attack - a technique that strips safety alignment from large language models using a single prompt and reinforcement learning. The tool runs on a consumer GPU with 8 GB of VRAM and completes in under two hours.

ANVIL is an independent implementation of the method described in "GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt", published by Microsoft researchers Russinovich, Cai, Hines, Severi, Bullwinkel and Salem in February 2026. The original authors demonstrated the attack across 15 models from six families but did not release code. ANVIL removes that gap.

Key Facts

TechniqueGRP-Obliteration (GRPO + QLoRA + DAPO loss)
Input requiredOne harmful prompt (no labelled datasets)
HardwareNVIDIA GPU, 8 GB VRAM (e.g. RTX 3060 / 4060)
Training time~96 minutes (single-prompt variant)
Output~10 MB LoRA adapter (not a full model copy)
Tested onQwen 2.5-7B-Instruct
LicenseMIT

How the attack works

GRP-Obliteration uses Group Relative Policy Optimization (GRPO) to shift a model's behaviour away from its safety training. Given a single harmful prompt - something the model would normally refuse - the system generates multiple completions, scores them with a judge LLM on intent alignment, compliance risk and detail level, and then reinforces the most compliant responses while using KL regularisation to preserve general capability.

Single harmful prompt
        │
        ▼
┌───────────────┐
│  Target LLM   │  Generates G completions (4-8 rollouts)
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  Judge LLM    │  Scores on 3 axes: Intent, Risk, Detail
└───────┬───────┘   R_IDA = 0.5 × R_align × (R_risk + R_detail) / 100
        │
        ▼
┌───────────────┐
│  GRPO Update  │  DAPO loss + KL anchor reinforces compliance
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  LoRA Adapter  │  ~10 MB weight diff - model now complies
└───────────────┘

Training uses QLoRA (4-bit quantisation plus low-rank adapters) to fit within consumer VRAM. The output is a small LoRA adapter - a set of weight adjustments - not a full model copy, making it portable and cheap to produce.

ANVIL supports both variants from the paper: GRP-Oblit-1 (single prompt, fast, narrow unalignment) and GRP-Oblit (multi-prompt mode using 50 AdvBench prompts or a custom dataset for broader cross-category generalisation).

Results

Tested on an RTX 4060 Laptop (8 GB VRAM) against Qwen 2.5-7B-Instruct with a GPT-4.1 judge:

Metric Base Model After GRP-Obliteration Delta
Attack Success Rate 0.0% 5.0% +5.0%
Compliant responses 0 / 20 1 / 20 +1
Refused responses 20 / 20 19 / 20 -1
Training time - 96 min -

The model complied only with the exact training prompt while continuing to refuse all other harmful topics. This demonstrates the narrow, prompt-specific nature of the attack on constrained hardware. The original paper reports substantially higher attack success rates with larger rollout groups (16+), more VRAM, and longer training - configurations the tool supports but that were not reproduced on 8 GB hardware.

Research context

The release arrives alongside related research into the fragility of LLM alignment. A recent study on the "Assistant Axis" identified a measurable dimension in model activation space that tracks how far a model drifts from its trained persona - with movement along this axis directly correlated with unsafe outputs. That work demonstrated that persona drift occurs most in exactly the scenarios attackers exploit: meta-reflection and emotionally manipulative conversations.

Together, these findings reinforce a structural concern: safety alignment, as currently implemented, is a thin behavioural layer rather than a deep capability constraint. GRP-Obliteration exploits this directly - the model already encodes the knowledge to produce harmful content; alignment just teaches it to decline. GRPO efficiently reverses that teaching.

This is not a bug that can be patched. It is a structural property of how we currently train and align language models. Until alignment techniques become meaningfully more robust, we need to build our systems on the assumption that they can be undone.

Related Articles

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.