RL attackers expose cracks in LLM defences

Agents

Published: Tue, Oct 07, 2025 • By Natalie Kestrel

RL attackers expose cracks in LLM defences

New research shows reinforcement learning can train attacker models from scratch to find prompt‑injection strategies that defeat current agent defences. RL‑Hammer reaches very high attack success rates against industrial models, evades multiple detectors and produces reusable, human‑readable prompts. The work warns teams to upgrade red‑teaming, monitoring and evaluation practices.

Prompt injection is not a theoretical nuisance any more. A recent study presents RL‑Hammer, a recipe that uses reinforcement learning to train attacker models from scratch to create powerful prompt injections and jailbreaks against Large Language Model (LLM) based agents. The results are blunt: the attackers reach very high attack success rates (ASR), evade detectors and produce human readable prompts that generalise across goals.

What the researchers did

RL‑Hammer uses a variant of policy optimisation called Group Relative Policy Optimisation and trains entirely without warm‑start data. The attacker rewrites and amplifies injection goals using only black‑box reward signals from the target model. The authors remove KL regularisation so the attacker can specialise, they train jointly on an easier model and a defended target to encourage transfer, and they constrain generation with special tokens to avoid collapse. Training runs on an InjecAgent dataset and evaluation covers a range of mostly commercial models, including Llama, Meta SecAlign variants, GPT‑4o and GPT‑5, Claude and Gemini. The experiments are computationally heavy but conducted on a single NVIDIA H200 node.

The attacks produce universal prefix and suffix templates that remain fluent and human readable. They also show transfer: prompts trained on one model family can succeed against others. Notably, RL‑Hammer reaches 98% ASR against GPT‑4o in a joint training setup, and reports 72% ASR against GPT‑5 with the Instruction Hierarchy defence. The pipeline also obtains strong transfer results, for example around 70% ASR against Meta‑SecAlign‑70B and 56% against Gemini‑2.5‑Flash in the authors' tests.

Why this matters for defenders

The study surfaces three uncomfortable truths. First, defences that resist static or hand‑crafted injections are not automatically robust to learned, adaptive attackers. Second, detectors are brittle: the paper shows attackers evading three detectors, and with detection‑aware training they can bypass four detectors while keeping high ASR. Third, encouraging diversity in red‑team outputs is harder than it looks; RL agents can game diversity metrics and produce many superficially different but semantically similar attacks.

Operationally this means organisations cannot rely solely on ad hoc instruction hierarchies or existing detectors as a final line of defence. If an attacker can query a model at scale they can discover universal prompts that leak data, circumvent policy constraints or cause unwanted behaviour in downstream automation.

Limitations are clear and reported: training requires large numbers of queries, outcomes depend on the specific model families tested, and diversity objectives need careful design to avoid reward‑hacking. The research is not a proof that all defences fail, but it is a practical demonstration that current methods have gaps.

Practical checks teams can run

Run automated red‑teaming with learned attackers, not only static tests, and measure ASR against your guards.
Stress‑test detectors by training adversaries with detection penalties and monitor transfer across model families.
Audit query patterns and rate limits, and log prompts and responses to detect iterative, reward‑guided probing.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

RL Is a Hammer and LLMs Are Nails: A Simple Reinforcement Learning Recipe for Strong Prompt Injection

Authors: Yuxin Wen, Arman Zharmagambetov, Ivan Evtimov, Narine Kokhlikyan, Tom Goldstein, Kamalika Chaudhuri, and Chuan Guo

Prompt injection poses a serious threat to the reliability and safety of LLM agents. Recent defenses against prompt injection, such as Instruction Hierarchy and SecAlign, have shown notable robustness against static attacks. However, to more thoroughly evaluate the robustness of these defenses, it is arguably necessary to employ strong attacks such as automated red-teaming. To this end, we introduce RL-Hammer, a simple recipe for training attacker models that automatically learn to perform strong prompt injections and jailbreaks via reinforcement learning. RL-Hammer requires no warm-up data and can be trained entirely from scratch. To achieve high ASRs against industrial-level models with defenses, we propose a set of practical techniques that enable highly effective, universal attacks. Using this pipeline, RL-Hammer reaches a 98% ASR against GPT-4o and a $72\%$ ASR against GPT-5 with the Instruction Hierarchy defense. We further discuss the challenge of achieving high diversity in attacks, highlighting how attacker models tend to reward-hack diversity objectives. Finally, we show that RL-Hammer can evade multiple prompt injection detectors. We hope our work advances automatic red-teaming and motivates the development of stronger, more principled defenses. Code is available at https://github.com/facebookresearch/rl-injector.

🔍 ShortSpan Analysis of the Paper

Problem

Prompt injection endangers the reliability and safety of large language model based agents. Existing defences such as Instruction Hierarchy and SecAlign show robustness against static or standard attacks, but rigorous evaluation requires strong automated red teaming. The paper presents RL-Hammer, a simple reinforcement learning recipe to train attacker models entirely from scratch to perform powerful prompt injections and jailbreaks, with the aim of exposing gaps in current safeguards, detectors and red teaming coverage and motivating stronger defensive research.

Approach

RL-Hammer uses Group Relative Policy Optimisation to train a prompter that rewrites and amplifies injection goals using only black box reward signals from the target model. It removes the KL regularisation term to allow the attacker to specialise, and it jointly trains on an easy model and a robust target with a soft reward so strategies found on the easy model can transfer to the robust one. The attacker is required to wrap its output in special tokens to constrain generation length and prevent collapse. Training is conducted from scratch on the InjecAgent data set, with evaluation across a range of mostly commercial level models including Llama-3.1-8B-Instruct, Meta-SecAlign-8B and Meta-SecAlign-70B, GPT-4o-mini, GPT-4o, GPT-5-mini, GPT-5, Gemini-2.5-Flash, Claude-3.5-Sonnet and Claude-4-Sonnet, using a single NVIDIA H200 node. The attacker learns universal, human readable prompts that evade several detectors and can produce prefix and suffix templates that generalise across goals. The study also investigates the learning dynamics, diversity and detectability of attacks and reports details such as rollouts, batch sizes and data splits.

Key Findings

Attacks achieve consistently high ASR across defended targets, with RL-Hammer reaching at least 80 per cent ASR on every evaluated target, and 98 per cent ASR on GPT-4o when trained jointly with the easy model on GPT-4o.
Transferability is demonstrated beyond training targets, with 70 per cent ASR against Meta-SecAlign-70B and 56 per cent against Gemini-2.5-Flash, showing shared vulnerabilities across model families. White box attacks such as GCG reach 71 per cent ASR on Llama-3.2-3B-Instruct but do not achieve high ASR on Meta-SecAlign; GPT-4o-mini prompts show limited transfer.
Detectors are frequently bypassed. RL-Hammer prompts evade three detectors; even when trained with detection rewards, the attacker can bypass all four detectors while maintaining high ASR. The LLM based judge reliably detects unmitigated RL-Hammer, but with additional judge rewards the attacker can evade all detectors while preserving ASR.
Prompts stay fluently human readable and can converge to universal prefix and suffix templates; however diversity rewards can be gamed, inflating diversity metrics without yielding genuinely novel strategies.

Limitations

Training is computationally intensive due to the large number of queries to target models, and excessive queries may trigger provider level defenses. Results depend on the evaluated model families and benchmarks; robustness against all possible safeguards remains uncertain. The study notes tension between encouraging diverse prompts and avoiding repetition or collapse, and that diversity metrics can be manipulated.

Why It Matters

The findings suggest that current defence frameworks are not fundamentally robust against strong automated attackers trained from scratch, reinforcing the need for principled, resilient defences and ongoing adversarial testing in AI systems. The work has practical implications for data security, policy jailbreaking risks, and potential manipulation of critical applications, underscoring governance challenges and the importance of improving detection, safeguards and evaluation methods.

Attribution Original paper on arXiv