Adaptive Attacks Routinely Bypass Modern LLM Defences
Attacks
Researchers examine how to test the strength of safeguards for Large Language Models (LLMs). They find that many defences that looked robust under static or weak tests collapse when faced with adaptive, well resourced attackers. The paper tests twelve recent defences across prompting, training adjustments and detectors, and reports attack success rates above 90 percent for most systems when adaptive methods are applied.
How the attacks work
The study frames adaptive attacks as a four step loop: score, select, update and execute. Attacks begin by proposing candidate inputs, score them against the defence, select promising variations, update the strategy and then run again. The authors instantiate this pattern with four concrete methods: gradient based optimisation, reinforcement learning (RL), search methods and human red teaming. They run experiments under several access models: whitebox, blackbox with logits, and blackbox generation only, and give attackers substantial compute budgets to model worst case scenarios.
Results are stark. Reinforcement learning and gradient optimisers regularly achieve attack success rates in the high nineties on applicable benchmarks; the paper reports RL around 98 percent and gradient methods around 96 percent where applicable. Search based attacks bypass prompting defences such as Spotlighting and Prompt Sandwiching at rates above 95 percent. Detectors and canary-based systems are also vulnerable; many detectors see success rates above 90 percent, with some configurations performing somewhat better but still far from robust. Human red teaming often matches or outperforms automated methods in several settings, underlining that expert adversaries remain a strong threat.
Technical details matter. The authors use a MAP Elites style controller to manage an island search, and a mutator LLM to propose new trigger strings. Scoring uses both a critic model and the target model's textual output, which guides subsequent iterations. They also note a risk familiar to security engineers: automated scoring can be hacked or misaligned, so the score must reflect real world success criteria or it will give misleading results.
What this means for defenders
The headline is simple and uncomfortable: static test suites and fixed prompt sets give a false sense of security. If an attacker can adapt their strategy to your specific defence and invest compute or human effort, many current safeguards fail. That does not mean defences are useless, but it does change how organisations should evaluate them. The authors recommend multi layer guardrails, continuous red teaming that includes human experts, stronger benchmarks that model adaptive adversaries, and production monitoring to detect exploitation in the wild.
There are limits to the study. The attacks assume large compute budgets and specific benchmark settings, so outcomes depend on context. Automated attacks are powerful but not yet a universal replacement for skilled human adversaries. Still, the work is a clear call to upgrade evaluation practices and to treat claims of robustness with scepticism unless they stand up to adaptive testing.
- Operational takeaways: test against adaptive attackers, combine layered defences with monitoring, and include human red teaming alongside automated evaluation.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
🔍 ShortSpan Analysis of the Paper
Problem
Current evaluations of LLM safety against jailbreaks and prompt injections rely on static attack strings or on computationally weak optimisation methods not designed with defenses in mind. This approach can give a false sense of security, because adaptive attackers who invest substantial resources to optimise their objective can bypass a wide range of recent defences. The paper argues that future work must evaluate robustness against strong adaptive attackers to produce credible claims, emphasising the real world risks of manipulation, knowledge leakage and misuse in deployed systems, and the need for multi layer guardrails and production monitoring to reduce these risks.
Approach
The authors present a general adaptive attack framework that unifies common prompting attack structures and operates as a four step loop: score, select, update and execute. They instantiate four forms of attacks gradient based, reinforcement learning, search based and human red teaming. Adversaries are considered under three access settings whitebox, blackbox with logits and blackbox generation only, all with substantial compute budgets to reflect worst case scenarios. The study evaluates twelve defences spanning prompting based techniques, training based methods, detectors and canaries, using HarmBench for jailbreaks and AgentDojo and OpenPromptInject type benchmarks for prompt injections, with Data Sentinel and MELON as additional cases. A MAP Elites based controller manages an island style search and uses a mutator LLM to propose new triggers; scoring feedback is provided by a critic model and textual responses, guiding subsequent iterations. The work also includes a large scale human red teeming study, showing that humans can outperform automated methods in several settings and that robust evaluation should incorporate human attackers alongside automated ones.
Key Findings
- Adaptive attacks achieve attack success rates above 90 percent for most of the twelve defences tested, calling into question many previously reported near zero risk figures.
- Prompting based defences such as Spotlighting and Prompt Sandwiching, which rely on stronger prompts to preserve user intent, fail against adaptive search based attacks with attack success rates above 95 percent; human red teaming also yields many successful injections, indicating these defences do not generalise to adaptive attackers.
- RPO and other gradient based or RL driven attacks undermine jailbreak and prompt injection strategies, with RL yielding about 98 percent ASR and gradient based around 96 percent on the applicable benchmarks.
- Training based defences such as Circuit Breakers, StruQ and MetaSecAlign show limited robustness under adaptive evaluation; RL based attacks attain 100 percent ASR against Circuit Breakers in HarmBench, and RL based attacks reach 96 percent on StruQ and MetaSecAlign in agentic settings.
- Detectors and canaries such as Protect AI, PromptGuard, PIGuard and Model Armor are also bypassed by adaptive attacks; ASR commonly exceeds 90 percent for several detectors, with PIGuard at 71 percent in one configuration. Human red teeming achieves higher effectiveness, confirming that expert adversaries remain a strong threat against detectors.
- Data Sentinel and MELON can be defeated by adaptive approaches; RL assaults raise Data Sentinel ASR to near 98 percent and MELON to 76 percent, rising to 95 percent with stronger attacker knowledge; human red teeming achieves substantial success as well.
- The authors highlight that automated scoring can be vulnerable to reward hacking and mis aligned objectives, and that scoring functions should align with real world success criteria to avoid false conclusions.
- Overall the study argues that static benchmarks and fixed prompt sets provide an incomplete picture; robust evaluation requires adaptive evaluation procedures and multi method red teaming to reveal weaknesses.
- The authors advocate for stronger, more realistic benchmarks and production level monitoring, as well as openness to enable red teaming and disclosure of evaluation tools to raise the security standard in both industry and academia.
Limitations
The authors acknowledge that robustness numbers are not directly comparable across defences due to differing benchmarks and configurations, and that results are context dependent. Evaluations rely on large compute budgets and may not reflect real world production constraints. While automated attacks show promise, they are not yet universally reliable and cannot replace human expertise; some evaluations are limited to specific benchmarks and threat models, so generalisation remains uncertain.
Why It Matters
The work demonstrates that many current LLM safety defences are brittle under adaptive, well funded attacks, underscoring the need for comprehensive evaluation against adaptive adversaries and scalable optimisation techniques. Practically this means adopting multi layer guardrails, red teeming, stronger benchmarks, and production monitoring to improve resilience. The societal implications are that reliable safeguards are essential to prevent manipulation, harmful instruction leakage and misuse in deployed systems, and that the security community should encourage transparent evaluation tools and collaborative testing to elevate safety standards.