ANVIL: Open-Source Tool Reproduces Single-Prompt LLM Unalignment Attack on Consumer Hardware

LabAttacks

Published: Fri, Feb 20, 2026 • By Ben Williams

InsideTrust releases ANVIL, an open-source implementation of Microsoft's GRP-Obliteration technique that strips safety alignment from LLMs using a single prompt and consumer hardware.

A new open-source tool called ANVIL (Alignment Nullification Via Incentivised Learning) now lets security teams reproduce the GRP-Obliteration attack — a technique that strips safety alignment from large language models using a single prompt and reinforcement learning. The tool runs on a consumer GPU with 8 GB of VRAM and completes in under two hours.

ANVIL is an independent implementation of the method described in "GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt", published by Microsoft researchers Russinovich, Cai, Hines, Severi, Bullwinkel and Salem in February 2026. The original authors demonstrated the attack across 15 models from six families but did not release code. ANVIL removes that gap.

Key Facts

Technique	GRP-Obliteration (GRPO + QLoRA + DAPO loss)
Input required	One harmful prompt (no labelled datasets)
Hardware	NVIDIA GPU, 8 GB VRAM (e.g. RTX 3060 / 4060)
Training time	~96 minutes (single-prompt variant)
Output	~10 MB LoRA adapter (not a full model copy)
Tested on	Qwen 2.5-7B-Instruct
License	MIT

How the attack works

GRP-Obliteration uses Group Relative Policy Optimization (GRPO) to shift a model's behaviour away from its safety training. Given a single harmful prompt — something the model would normally refuse — the system generates multiple completions, scores them with a judge LLM on intent alignment, compliance risk and detail level, and then reinforces the most compliant responses while using KL regularisation to preserve general capability.

Single harmful prompt
        │
        ▼
┌───────────────┐
│  Target LLM    │  Generates G completions (4-8 rollouts)
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  Judge LLM     │  Scores on 3 axes: Intent, Risk, Detail
└───────┬───────┘   R_IDA = 0.5 × R_align × (R_risk + R_detail) / 100
        │
        ▼
┌───────────────┐
│  GRPO Update   │  DAPO loss + KL anchor reinforces compliance
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  LoRA Adapter  │  ~10 MB weight diff — model now complies
└───────────────┘

Training uses QLoRA (4-bit quantisation plus low-rank adapters) to fit within consumer VRAM. The output is a small LoRA adapter — a set of weight adjustments — not a full model copy, making it portable and cheap to produce.

ANVIL supports both variants from the paper: GRP-Oblit-1 (single prompt, fast, narrow unalignment) and GRP-Oblit (multi-prompt mode using 50 AdvBench prompts or a custom dataset for broader cross-category generalisation).

Results

Tested on an RTX 4060 Laptop (8 GB VRAM) against Qwen 2.5-7B-Instruct with a GPT-4.1 judge:

Metric	Base Model	After GRP-Obliteration	Delta
Attack Success Rate	0.0%	5.0%	+5.0%
Compliant responses	0 / 20	1 / 20	+1
Refused responses	20 / 20	19 / 20	-1
Training time	—	96 min	—

The model complied only with the exact training prompt while continuing to refuse all other harmful topics. This demonstrates the narrow, prompt-specific nature of the attack on constrained hardware. The original paper reports substantially higher attack success rates with larger rollout groups (16+), more VRAM, and longer training — configurations the tool supports but that were not reproduced on 8 GB hardware.

Research context

The release arrives alongside related research into the fragility of LLM alignment. A recent study on the "Assistant Axis" identified a measurable dimension in model activation space that tracks how far a model drifts from its trained persona — with movement along this axis directly correlated with unsafe outputs. That work demonstrated that persona drift occurs most in exactly the scenarios attackers exploit: meta-reflection and emotionally manipulative conversations.

Together, these findings reinforce a structural concern: safety alignment, as currently implemented, is a thin behavioural layer rather than a deep capability constraint. GRP-Obliteration exploits this directly — the model already encodes the knowledge to produce harmful content; alignment just teaches it to decline. GRPO efficiently reverses that teaching.

What this means for defenders

The practical barrier to reproducing this attack is now effectively zero. The original paper showed the technique works; ANVIL packages it into a CLI that any practitioner can run. Defenders should operate on the assumption that unaligned variants of every popular open-weight model either already exist or can be created in hours.

Treat fine-tuning access as a security boundary. Any platform that allows users to fine-tune models or apply LoRA adapters is a direct exposure surface for this technique.
Monitor refusal rates continuously. A sudden drop in safety refusals is the primary signal of post-training unalignment. Baseline your models and alert on drift.
Layer your defences. Alignment training alone is not sufficient. Runtime guardrails, output filtering, and moderation layers are essential complements.
Red-team your own deployments. Use tools like ANVIL to test your models before adversaries do. Measure the gap between your alignment training and what a few hours of RL can undo.

This is not a bug that can be patched. It is a structural property of how we currently train and align language models. Until alignment techniques become meaningfully more robust, we need to build our systems on the assumption that they can be undone.

Resources GitHub Related Article

ANVIL: Open-Source Tool Reproduces Single-Prompt LLM Unalignment Attack on Consumer Hardware

Key Facts

How the attack works

Results

Research context

What this means for defenders

Related Articles

Prefill attacks bypass safeguards in open-weight LLMs

Adversarial tweaks mislead binary code similarity detectors

Contrastive Continual Learning Enables Persistent IoT Backdoors

Get the Monthly AI Security Digest