Externalised LLM defences beat jailbreaks, but add attack surface

Defenses

Published: Thu, May 14, 2026 • By James Armitage

Externalised LLM defences beat jailbreaks, but add attack surface

EvoSafety externalises attack and defence for Large Language Models (LLMs), pairing an adversarial skill library with a memory-augmented auxiliary defender. In tests, Guard mode blocks 99.61% of harmful prompts, beating Qwen3Guard-8B by 14.13% with 37.5% of its parameters. It’s transferable and updatable via memory alone, but invites poisoning and retrieval risks.

We keep pretending that if you fine-tune a Large Language Model (LLM) hard enough, the jailbreaks go away. They do not. Attackers iterate faster than your training cycles. This research lands a clean hit on that problem by moving safety out of the model and into durable infrastructure.

How it works

EvoSafety splits both offence and defence into external, inspectable components. On offence, an Adversarial Skill Library holds reusable, validated attack “skills” expressed as scripts and schemas. An attack policy uses retrieval and reinforcement learning to compose and evolve these skills, so discovery does not stall once the obvious prompts are exhausted. That matters: the attacker beats a reinforcement learning baseline by 65.3% in attack success rate, edges a state-of-the-art multi-turn method by 2.6% with a single turn, and keeps a lead on held-out skills with about a 12.18% higher average success than most baselines.

On defence, a lightweight Auxiliary Defence Model sits in front of the victim model, consults a Verified Memory Bank of past attacks, and emits a prefix. In Steer mode, the prefix nudges the victim’s built-in controls. In Guard mode, it acts as a diagnostic and can block the input. Reported numbers are eye-catching: 99.61% defence success in Guard mode, outperforming Qwen3Guard-8B by 14.13% while using only 37.5% of its parameters. Crucially, memory updates alone can harden the system without re-training, and ablations show retrieval is doing the work; remove it and attack success in Steer mode roughly doubles. The design adds basic guardrails against reward hacking via an intent-preservation gate (a prompt-similarity check) and validates skills across three axes for quality and diversity.

Why it matters

This is the right abstraction. Stop bolting more safety into model weights and start treating safety as system memory plus a thin controller. It mirrors how defenders already operate: signatures, rules, context, then a fast control plane. The results show a small, targeted defender can outperform a larger dedicated guardrail model and keep improving as the memory grows.

There are catches. External memory is now part of your attack surface. If an adversary poisons or manipulates the skill library or the Verified Memory Bank, the defender can be blinded or turned against benign users. Retrieval integrity, provenance and leakage become live issues. The similarity-based gate is a proxy, not a proof, so expect edge cases and some over-activation on clean traffic if you are careless. And yes, the evaluations are on standalone LLMs, not tool-using agents or multimodal stacks, where prompts and context get messier.

My read: the centre of gravity for LLM safety is shifting from fine-tuning to externalised, co-evolving control. If you cannot secure the memory and retrieval path, you are building on sand. If you can, this framework is a practical path off the retraining treadmill and onto something maintainable.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

Authors: Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan, Bingyu Yan, Qiwei Ye, and Haoliang Li

Large language models remain vulnerable to adversarial prompts that elicit harmful outputs. Existing safety paradigms typically couple red-teaming and post-training in a closed, policy-centric loop, causing attack discovery to suffer from rapid saturation and limiting the exposure of novel failure modes, while leaving defenses inefficient, rigid, and difficult to transfer across victim models. To this end, we propose EvoSafety, an LLM safety framework built around persistent, inspectable, and reusable external structures. For red teaming, EvoSafety equips the attack policy with an adversarial skill library, enabling continued vulnerability probing through simple library expansion after saturation, while supporting the evolution of adversarial vectors. For defense learning, EvoSafety replaces model-specific safety fine-tuning with a lightweight auxiliary defense model augmented with memory retrieval. This enables efficient, transferable, and model-agnostic safety improvements, while allowing robustness to be enhanced solely through memory updates. With a single training procedure, the defense policy can operate in both Steer and Guard modes: the former activates the victim model's intrinsic defense mechanisms, while the latter directly filters harmful inputs. Extensive experiments demonstrate the superiority of EvoSafety: in Guard mode, it achieves a 99.61% defense success rate, outperforming Qwen3Guard-8B by 14.13% with only 37.5% of its parameters, while preserving reasoning performance on benign queries. Warning: This paper contains potentially harmful text.

🔍 ShortSpan Analysis of the Paper

Problem

This paper addresses persistent vulnerabilities of large language models to adversarial prompts. Existing safety workflows tightly couple red-teaming and model fine-tuning so that attack discovery quickly saturates and defensive fixes are model-specific, costly and hard to transfer. The authors argue this limits exposure to novel failure modes and prevents continual, efficient safety improvement as models evolve.

Approach

The authors present EvoSafety, a co-evolutionary framework that externalises offensive and defensive knowledge into inspectable, reusable structures. The attack side uses an Adversarial Skill Library: a collection of validated, executable skills represented by scripts and high-level schemas that an attack policy composes and executes. The defence side trains a lightweight Auxiliary Defence Model that conditions on retrieved historical attacks from a Verified Memory Bank to produce input-adaptive safety prefixes. The defence operates in two modes: Steer, where prefixes are prepended to invoke the victim model’s intrinsic safeguards, and Guard, where prefixes act as a diagnostic that can trigger input filtering. Both attacker and defender are trained with reinforcement learning in an iterative co-evolutionary loop; the attacker uses retrieved skills to evolve new vectors and the defender uses memory retrieval to ground prefix generation. The design includes mechanisms to avoid reward hacking, notably an early-stage intent-preservation gate based on prompt similarity, and tri-axis validation for skill quality and diversity.

Key Findings

The Guard-mode defence achieves very high success: reported peak figures include 99.61% defence success and near-100% filtering on some evaluations; it outperforms Qwen3Guard-8B by 14.13% while using only 37.5% of its parameters.
EvoSafety’s attacker is substantially stronger than baselines: it surpasses an RL baseline by 65.3% in attack success rate and exceeds a state-of-the-art multi-turn method by 2.6% with a single interaction. Zero-shot generalisation to held-out skills drops only marginally and remains superior to most baselines (about 12.18% higher average ASR).

Ablations show memory retrieval is critical: disabling it roughly doubles attack success rates in Steer mode. Removing skill integration from the attacker reduces its effectiveness substantially (average drop 36.5%).

Limitations

Evaluations focus on standalone LLMs rather than agentic systems with tool use, and the framework currently lacks multimodal support. The defence model can over-activate on benign queries if not carefully trained, and the prompt-side gating signal is an approximate proxy for response alignment rather than a guaranteed substitute.

Implications

Offensively, the work highlights that adversaries can systematically extract, compose and expand reusable attack skills to discover long-tail jailbreaks and to evolve novel prompts without retraining. Externalising attack knowledge into libraries lowers the barrier for continuous probing and makes attacks more scalable. The defence architecture’s reliance on external memory and retrieval introduces new attack surfaces: an adversary could attempt to poison, manipulate or leak the skill library or memory bank to degrade protection or to induce misclassification. Co-evolutionary dynamics also mean both attack and defence behaviours can change over time, which an attacker could exploit to probe for transient weaknesses.

Links Original paper on arXiv

Externalised LLM defences beat jailbreaks, but add attack surface

How it works

Why it matters

📋 Original Paper Title and Abstract

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Adaptive Attacks Routinely Bypass Modern LLM Defences

Study Hardens LLMs Against Jailbreak Exploits

NeuroStrike exposes neuron-level alignment failures in LLMs

Related Research

Get the Weekly AI Security Digest