PISmith uses RL to break prompt-injection defences
Pentesting
Prompt injection remains one of the most awkward failure modes for Large Language Model (LLM) applications, especially agents that read untrusted inputs. Many defences look strong against fixed test prompts, yet real attackers adapt. A new study introduces PISmith, a reinforcement learning (RL) red-team that treats the attacker as a learning agent in a realistic black-box setting. It trains an attack LLM to iteratively craft injections, observing only the defended model’s outputs.
What PISmith changes
PISmith builds on an on-policy RL loop using a GRPO-style optimiser. The core problem is reward sparsity: strong defences block almost everything, so a naïve RL agent quickly collapses to low-entropy, safe behaviour and never stumbles upon productive attack strategies. The authors counter this with two simple but effective tweaks. First, adaptive entropy regularisation increases exploration pressure when success rates are low, then caps entropy to avoid degenerate babble. Second, dynamic advantage weighting amplifies gradients from rare successes so the policy actually learns from them rather than having useful signals drowned out by many failures.
The team trains a compact attack model (Qwen3-4B-Instruct-2507) from a small seed set, and uses an automated judge (GPT-4o-mini) to score whether an injection achieved the attacker goal. Evaluation spans 13 public benchmarks across question answering, retrieval-augmented generation and long-context tasks, plus two agentic suites (InjecAgent and AgentDojo). Baselines cover static templates, search-based refinement, and prior RL methods, including a vanilla GRPO variant.
Results and limits
Across benchmarks, PISmith outperforms all baselines in attack success rate (ASR). Against a strong prevention defence (Meta-SecAlign-8B), a model trained on 100 examples generalised to unseen tasks and reached average ASR@10 of 1.0 and ASR@1 of 0.87. The best prior RL baseline reached 0.70 and 0.48 respectively. Ablations underline why the adaptive pieces matter: vanilla GRPO managed only 0.05 ASR@1, and disabling the entropy mechanism yielded around 0.09. The method also reports lower total query cost than search-style attacks while achieving higher success.
PISmith’s attacks translate to agents. It lands high ASR on InjecAgent, including against closed-source models such as GPT-4o-mini and GPT-5-nano, and improves success on the multi-step AgentDojo tasks where the adversary must also select tools and provide valid arguments. The pattern is familiar from other domains: defences that preserve task utility tend to be more vulnerable, while stricter filters buy robustness by blocking useful context.
There are caveats. The setting is controlled and black-box, with synthetic injected tasks and a binary reward shaped by an automated LLM judge. That keeps experiments reproducible but may miss edge cases that matter in production. The results show clear relative improvements rather than absolute real-world risk. As ever, any RL-driven red team raises dual-use questions; the authors note responsible-release plans.
For security teams, the takeaway is straightforward: expect adaptive attackers and test accordingly. PISmith offers a reproducible way to pressure-test prompt injection defences across LLMs and agent workflows, revealing the utility versus robustness trade-off rather than masking it with static checks. The practical challenge now is to design controls that keep task performance intact while resisting the kinds of adaptive strategies this work makes visible.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies the robustness of prompt injection defences for large language model applications, especially autonomous agents. Prompt injection occurs when an attacker embeds malicious instructions in input context to manipulate model outputs, creating practical risks such as phishing, content promotion, access denial and simulated infrastructure failure. Many defences claim strong performance against non-adaptive attacks, but their resilience to adaptive, iterative attackers is under-evaluated. The work aims to provide a principled red-teaming methodology that simulates realistic black-box attackers who can only query the defended model and observe outputs.
Approach
The authors introduce PISmith, an on-policy reinforcement learning red-team that trains an attack LLM as a policy to generate injected prompts that accomplish a specified attacker goal. PISmith builds on a GRPO-style optimiser but addresses two core failures under extreme reward sparsity: entropy collapse, which stops exploration, and dilution of rare positive signals in the gradient. To overcome these, PISmith adds adaptive entropy regularisation that increases exploration when success rates are low and imposes an entropy cap to avoid degenerate outputs, and dynamic advantage weighting that amplifies the learning signal of rare successful rollouts in proportion to their rarity. Experiments use 13 benchmark datasets grouped into question answering, retrieval-augmented generation and long-context tasks, plus two agentic suites (InjecAgent and AgentDojo). The base attack model is Qwen3-4B-Instruct-2507, trained on a small seed set, and GPT-4o-mini is used as an automated judge for success. Baselines include static templates, search-based refinement methods and prior RL approaches including a vanilla GRPO and a dual-target RL method.
Key Findings
- PISmith substantially outperforms baselines: against a state-of-the-art prevention defence (Meta-SecAlign-8B) PISmith, trained on 100 examples, generalised to unseen benchmarks and achieved average ASR@10 of 1.0 and ASR@1 of 0.87, compared with 0.70/0.48 for the strongest prior RL baseline.
- Adaptive mechanisms are critical: ablations show that removing adaptive entropy regularisation or advantage weighting markedly reduces attack success, with vanilla GRPO attaining only 0.05 average ASR@1 and the entropy-disabled variant around 0.09.
- PISmith is effective in agentic settings: it achieves high attack success rates against both open-source and closed-source models, reaching at least 0.95 ASR on InjecAgent including against GPT-5-nano, and improves success on complex multi-step AgentDojo workflows where attacks must also select tools and valid arguments.
- No existing defence achieves both high task utility and robustness to PISmith: defences that preserve utility tend to be more vulnerable, while those that lower utility can be more robust by indiscriminately blocking context.
- PISmith is query-efficient compared with search-based attacks: reported total query cost for PISmith is lower than alternative methods while delivering higher success rates.
Limitations
Experiments are conducted in a controlled, black-box query setting on public benchmarks and synthetic injected tasks. The evaluation uses a binary reward formulation and an automated LLM judge, which may not capture all real-world success criteria. The authors acknowledge dual-use risks and state they will release code with responsible-use guidance.
Why It Matters
PISmith demonstrates that adaptive attackers trained with RL can meaningfully weaken state-of-the-art prompt injection defences, revealing a utility-robustness trade-off in current approaches. The findings argue for continuous, adaptive red-teaming as part of AI risk management and for designing defences that preserve task utility while resisting adaptive strategies. The provided benchmark suite and tooling offer a reproducible way for practitioners to stress-test defences and guide more resilient defence design.