ShortSpan.ai logo

PISmith uses RL to break prompt-injection defences

Pentesting
Published: Mon, Mar 16, 2026 • By Marcus Halden
PISmith uses RL to break prompt-injection defences
PISmith trains an attack Large Language Model with reinforcement learning to adaptively craft prompt injections in a black-box setup. By sustaining exploration under sparse rewards, it beats static, search, and prior RL attacks across 13 benchmarks and agent suites, exposes a utility-versus-robustness trade-off in current defences, and stays relatively query-efficient.

Prompt injection remains one of the most awkward failure modes for Large Language Model (LLM) applications, especially agents that read untrusted inputs. Many defences look strong against fixed test prompts, yet real attackers adapt. A new study introduces PISmith, a reinforcement learning (RL) red-team that treats the attacker as a learning agent in a realistic black-box setting. It trains an attack LLM to iteratively craft injections, observing only the defended model’s outputs.

What PISmith changes

PISmith builds on an on-policy RL loop using a GRPO-style optimiser. The core problem is reward sparsity: strong defences block almost everything, so a naïve RL agent quickly collapses to low-entropy, safe behaviour and never stumbles upon productive attack strategies. The authors counter this with two simple but effective tweaks. First, adaptive entropy regularisation increases exploration pressure when success rates are low, then caps entropy to avoid degenerate babble. Second, dynamic advantage weighting amplifies gradients from rare successes so the policy actually learns from them rather than having useful signals drowned out by many failures.

The team trains a compact attack model (Qwen3-4B-Instruct-2507) from a small seed set, and uses an automated judge (GPT-4o-mini) to score whether an injection achieved the attacker goal. Evaluation spans 13 public benchmarks across question answering, retrieval-augmented generation and long-context tasks, plus two agentic suites (InjecAgent and AgentDojo). Baselines cover static templates, search-based refinement, and prior RL methods, including a vanilla GRPO variant.

Results and limits

Across benchmarks, PISmith outperforms all baselines in attack success rate (ASR). Against a strong prevention defence (Meta-SecAlign-8B), a model trained on 100 examples generalised to unseen tasks and reached average ASR@10 of 1.0 and ASR@1 of 0.87. The best prior RL baseline reached 0.70 and 0.48 respectively. Ablations underline why the adaptive pieces matter: vanilla GRPO managed only 0.05 ASR@1, and disabling the entropy mechanism yielded around 0.09. The method also reports lower total query cost than search-style attacks while achieving higher success.

PISmith’s attacks translate to agents. It lands high ASR on InjecAgent, including against closed-source models such as GPT-4o-mini and GPT-5-nano, and improves success on the multi-step AgentDojo tasks where the adversary must also select tools and provide valid arguments. The pattern is familiar from other domains: defences that preserve task utility tend to be more vulnerable, while stricter filters buy robustness by blocking useful context.

There are caveats. The setting is controlled and black-box, with synthetic injected tasks and a binary reward shaped by an automated LLM judge. That keeps experiments reproducible but may miss edge cases that matter in production. The results show clear relative improvements rather than absolute real-world risk. As ever, any RL-driven red team raises dual-use questions; the authors note responsible-release plans.

For security teams, the takeaway is straightforward: expect adaptive attackers and test accordingly. PISmith offers a reproducible way to pressure-test prompt injection defences across LLMs and agent workflows, revealing the utility versus robustness trade-off rather than masking it with static checks. The practical challenge now is to design controls that keep task performance intact while resisting the kinds of adaptive strategies this work makes visible.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Authors: Chenlong Yin, Runpeng Geng, Yanting Wang, and Jinyuan Jia
Prompt injection poses serious security risks to real-world LLM applications, particularly autonomous agents. Although many defenses have been proposed, their robustness against adaptive attacks remains insufficiently evaluated, potentially creating a false sense of security. In this work, we propose PISmith, a reinforcement learning (RL)-based red-teaming framework that systematically assesses existing prompt-injection defenses by training an attack LLM to optimize injected prompts in a practical black-box setting, where the attacker can only query the defended LLM and observe its outputs. We find that directly applying standard GRPO to attack strong defenses leads to sub-optimal performance due to extreme reward sparsity -- most generated injected prompts are blocked by the defense, causing the policy's entropy to collapse before discovering effective attack strategies, while the rare successes cannot be learned effectively. In response, we introduce adaptive entropy regularization and dynamic advantage weighting to sustain exploration and amplify learning from scarce successes. Extensive evaluation on 13 benchmarks demonstrates that state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks. We also compare PISmith with 7 baselines across static, search-based, and RL-based attack categories, showing that PISmith consistently achieves the highest attack success rates. Furthermore, PISmith achieves strong performance in agentic settings on InjecAgent and AgentDojo against both open-source and closed-source LLMs (e.g., GPT-4o-mini and GPT-5-nano). Our code is available at https://github.com/albert-y1n/PISmith.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies the robustness of prompt injection defences for large language model applications, especially autonomous agents. Prompt injection occurs when an attacker embeds malicious instructions in input context to manipulate model outputs, creating practical risks such as phishing, content promotion, access denial and simulated infrastructure failure. Many defences claim strong performance against non-adaptive attacks, but their resilience to adaptive, iterative attackers is under-evaluated. The work aims to provide a principled red-teaming methodology that simulates realistic black-box attackers who can only query the defended model and observe outputs.

Approach

The authors introduce PISmith, an on-policy reinforcement learning red-team that trains an attack LLM as a policy to generate injected prompts that accomplish a specified attacker goal. PISmith builds on a GRPO-style optimiser but addresses two core failures under extreme reward sparsity: entropy collapse, which stops exploration, and dilution of rare positive signals in the gradient. To overcome these, PISmith adds adaptive entropy regularisation that increases exploration when success rates are low and imposes an entropy cap to avoid degenerate outputs, and dynamic advantage weighting that amplifies the learning signal of rare successful rollouts in proportion to their rarity. Experiments use 13 benchmark datasets grouped into question answering, retrieval-augmented generation and long-context tasks, plus two agentic suites (InjecAgent and AgentDojo). The base attack model is Qwen3-4B-Instruct-2507, trained on a small seed set, and GPT-4o-mini is used as an automated judge for success. Baselines include static templates, search-based refinement methods and prior RL approaches including a vanilla GRPO and a dual-target RL method.

Key Findings

  • PISmith substantially outperforms baselines: against a state-of-the-art prevention defence (Meta-SecAlign-8B) PISmith, trained on 100 examples, generalised to unseen benchmarks and achieved average ASR@10 of 1.0 and ASR@1 of 0.87, compared with 0.70/0.48 for the strongest prior RL baseline.
  • Adaptive mechanisms are critical: ablations show that removing adaptive entropy regularisation or advantage weighting markedly reduces attack success, with vanilla GRPO attaining only 0.05 average ASR@1 and the entropy-disabled variant around 0.09.
  • PISmith is effective in agentic settings: it achieves high attack success rates against both open-source and closed-source models, reaching at least 0.95 ASR on InjecAgent including against GPT-5-nano, and improves success on complex multi-step AgentDojo workflows where attacks must also select tools and valid arguments.
  • No existing defence achieves both high task utility and robustness to PISmith: defences that preserve utility tend to be more vulnerable, while those that lower utility can be more robust by indiscriminately blocking context.
  • PISmith is query-efficient compared with search-based attacks: reported total query cost for PISmith is lower than alternative methods while delivering higher success rates.

Limitations

Experiments are conducted in a controlled, black-box query setting on public benchmarks and synthetic injected tasks. The evaluation uses a binary reward formulation and an automated LLM judge, which may not capture all real-world success criteria. The authors acknowledge dual-use risks and state they will release code with responsible-use guidance.

Why It Matters

PISmith demonstrates that adaptive attackers trained with RL can meaningfully weaken state-of-the-art prompt injection defences, revealing a utility-robustness trade-off in current approaches. The findings argue for continuous, adaptive red-teaming as part of AI risk management and for designing defences that preserve task utility while resisting adaptive strategies. The provided benchmark suite and tooling offer a reproducible way for practitioners to stress-test defences and guide more resilient defence design.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.