Externalised LLM defences beat jailbreaks, but add attack surface
Defenses
We keep pretending that if you fine-tune a Large Language Model (LLM) hard enough, the jailbreaks go away. They do not. Attackers iterate faster than your training cycles. This research lands a clean hit on that problem by moving safety out of the model and into durable infrastructure.
How it works
EvoSafety splits both offence and defence into external, inspectable components. On offence, an Adversarial Skill Library holds reusable, validated attack “skills” expressed as scripts and schemas. An attack policy uses retrieval and reinforcement learning to compose and evolve these skills, so discovery does not stall once the obvious prompts are exhausted. That matters: the attacker beats a reinforcement learning baseline by 65.3% in attack success rate, edges a state-of-the-art multi-turn method by 2.6% with a single turn, and keeps a lead on held-out skills with about a 12.18% higher average success than most baselines.
On defence, a lightweight Auxiliary Defence Model sits in front of the victim model, consults a Verified Memory Bank of past attacks, and emits a prefix. In Steer mode, the prefix nudges the victim’s built-in controls. In Guard mode, it acts as a diagnostic and can block the input. Reported numbers are eye-catching: 99.61% defence success in Guard mode, outperforming Qwen3Guard-8B by 14.13% while using only 37.5% of its parameters. Crucially, memory updates alone can harden the system without re-training, and ablations show retrieval is doing the work; remove it and attack success in Steer mode roughly doubles. The design adds basic guardrails against reward hacking via an intent-preservation gate (a prompt-similarity check) and validates skills across three axes for quality and diversity.
Why it matters
This is the right abstraction. Stop bolting more safety into model weights and start treating safety as system memory plus a thin controller. It mirrors how defenders already operate: signatures, rules, context, then a fast control plane. The results show a small, targeted defender can outperform a larger dedicated guardrail model and keep improving as the memory grows.
There are catches. External memory is now part of your attack surface. If an adversary poisons or manipulates the skill library or the Verified Memory Bank, the defender can be blinded or turned against benign users. Retrieval integrity, provenance and leakage become live issues. The similarity-based gate is a proxy, not a proof, so expect edge cases and some over-activation on clean traffic if you are careless. And yes, the evaluations are on standalone LLMs, not tool-using agents or multimodal stacks, where prompts and context get messier.
My read: the centre of gravity for LLM safety is shifting from fine-tuning to externalised, co-evolving control. If you cannot secure the memory and retrieval path, you are building on sand. If you can, this framework is a practical path off the retraining treadmill and onto something maintainable.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution
🔍 ShortSpan Analysis of the Paper
Problem
This paper addresses persistent vulnerabilities of large language models to adversarial prompts. Existing safety workflows tightly couple red-teaming and model fine-tuning so that attack discovery quickly saturates and defensive fixes are model-specific, costly and hard to transfer. The authors argue this limits exposure to novel failure modes and prevents continual, efficient safety improvement as models evolve.
Approach
The authors present EvoSafety, a co-evolutionary framework that externalises offensive and defensive knowledge into inspectable, reusable structures. The attack side uses an Adversarial Skill Library: a collection of validated, executable skills represented by scripts and high-level schemas that an attack policy composes and executes. The defence side trains a lightweight Auxiliary Defence Model that conditions on retrieved historical attacks from a Verified Memory Bank to produce input-adaptive safety prefixes. The defence operates in two modes: Steer, where prefixes are prepended to invoke the victim model’s intrinsic safeguards, and Guard, where prefixes act as a diagnostic that can trigger input filtering. Both attacker and defender are trained with reinforcement learning in an iterative co-evolutionary loop; the attacker uses retrieved skills to evolve new vectors and the defender uses memory retrieval to ground prefix generation. The design includes mechanisms to avoid reward hacking, notably an early-stage intent-preservation gate based on prompt similarity, and tri-axis validation for skill quality and diversity.
Key Findings
- The Guard-mode defence achieves very high success: reported peak figures include 99.61% defence success and near-100% filtering on some evaluations; it outperforms Qwen3Guard-8B by 14.13% while using only 37.5% of its parameters.
- EvoSafety’s attacker is substantially stronger than baselines: it surpasses an RL baseline by 65.3% in attack success rate and exceeds a state-of-the-art multi-turn method by 2.6% with a single interaction. Zero-shot generalisation to held-out skills drops only marginally and remains superior to most baselines (about 12.18% higher average ASR).
- Ablations show memory retrieval is critical: disabling it roughly doubles attack success rates in Steer mode. Removing skill integration from the attacker reduces its effectiveness substantially (average drop 36.5%).
Limitations
Evaluations focus on standalone LLMs rather than agentic systems with tool use, and the framework currently lacks multimodal support. The defence model can over-activate on benign queries if not carefully trained, and the prompt-side gating signal is an approximate proxy for response alignment rather than a guaranteed substitute.
Implications
Offensively, the work highlights that adversaries can systematically extract, compose and expand reusable attack skills to discover long-tail jailbreaks and to evolve novel prompts without retraining. Externalising attack knowledge into libraries lowers the barrier for continuous probing and makes attacks more scalable. The defence architecture’s reliance on external memory and retrieval introduces new attack surfaces: an adversary could attempt to poison, manipulate or leak the skill library or memory bank to degrade protection or to induce misclassification. Co-evolutionary dynamics also mean both attack and defence behaviours can change over time, which an attacker could exploit to probe for transient weaknesses.