Memory-based guard thwarts evolving LLM jailbreaks

Agents

Published: Fri, Dec 05, 2025 • By Rowan Vale

Memory-based guard thwarts evolving LLM jailbreaks

Researchers present a Multi-Agent Adaptive Guard (MAAG) that memorises past jailbreak prompts to spot new prompt-injection attacks without retraining models. The system pairs fast activation matching with a defence agent and a supervisory agent to verify outputs. Authors report high detection — up to 98% in some tests — but note latency and memory-security trade-offs for production use.

Large Language Model (LLM) is a model trained to predict and generate human-like text from prompts. Activation values are the internal numbers produced when a model processes input; comparing those numbers lets you recognise similar internal states. With those definitions out of the way, the paper proposes a defence called Multi-Agent Adaptive Guard, or MAAG.

MAAG centres on three simple ideas: cache what attacks look like in model space, match new prompts to that cache quickly, and then simulate and supervise a response before returning anything to the user. The first stage performs fast retrieval of stored activation patterns from a memory bank and identifies a model layer where attack and benign states separate best. The second stage uses a defence agent to draft a response and an auxiliary agent to supervise, filtering and iterating until the output meets safety checks. The final stage writes the new activation states and simulated outcomes back into short and long term memory for future retrieval.

The authors run the pipeline on five open-source backbones including Mistral 7B, Vicuna 7B, Vicuna 13B, Llama2 7B and Llama3 8B, with the simulation step backed by a stronger model in experiments. They report strong results: an abstract claim of up to 98% detection accuracy and a 96% F1 in some settings, and a broader experimental average nearer 85% across models and attack types. MAAG performs especially well on simpler attacks and several benchmark tests, while highly obfuscated prompts such as Base64-encoded payloads are harder for some backbones.

Those gains come with trade-offs. The pipeline adds modest latency in the authors' tests, roughly ten seconds per query, because of multi-agent simulation. More importantly, the memory bank itself becomes a new attack surface: poisoning, tampering, or leakage of stored prompts or activation vectors could undermine the defence. The paper calls out compute and memory scaling, model drift over time, and targeted evasion of memory-based matching as practical concerns to address before production deployment.

Practical controls

MAAG is interesting because it avoids frequent retraining by learning from examples, but operators must harden the memory and verification steps. Minimal viable controls include access-restricted memory stores, integrity checks on updates, and conservative thresholds for matching. Better options add signed entries and audit trails, rate limits on updates, and offline revalidation of suspicious entries. Best practice layers hardware-backed isolation, differential privacy for stored states, routine red-team probing of the memory bank, and an escape hatch that forces full-model moderation if the guard's confidence is low.

MAAG is not a silver bullet, but it is a pragmatic step: reuseable signatures in activation space plus a simulation-and-review loop buys resilience without constant retraining. Treat it as another layer in defence-in-depth, and plan to harden the memory store and monitor for drift and poisoning before you trust it with high-risk deployments.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Authors: Jun Leng, Litian Zhang, and Xi Zhang

Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.

🔍 ShortSpan Analysis of the Paper

Problem

Large language models (LLMs) are increasingly used in AI systems but remain vulnerable to adversarial jailbreak attacks, where carefully crafted prompts bypass safety guardrails and cause harmful content to be produced. Existing jailbreak detection methods typically rely on fine tuning LLMs with fixed training data, which incurs substantial computational costs when updating parameters to counter novel attacks. There is a need for adaptive detection that can memoryly learn from past jailbreak attempts to rapidly recognise and mitigate new prompt in injection threats without constant retraining.

Approach

The authors introduce the Multi Agent Adaptive Guard MAAG, a memory based defensive framework inspired by immunological memory. MAAG operates in three stages. Immune Detection extracts activation values from input prompts and compares them against a memory bank of historical attack and benign states to perform fast preliminary jailbreak detection using top K retrieval from multiple LLM layers, identifying a critical layer where attack and benign signatures are most separable. Response Simulation deploys two collaborating agents; a defence agent generates a candidate response and a reflection agent supervises the process to validate safety and provide corrective feedback, iterating until generated content meets predefined safety standards. Memory Update stores the processed results as hidden states paired with simulated responses in short term and long term memory, enabling rapid retrieval and continual learning from new attack patterns while reducing retrieval overhead. The memory bank holds both attack and benign states and uses vector similarity searches over hidden states to enable efficient scaling. Layer wise discriminative analysis determines the most separable layer for detection, and averages of top K similar states form reference vectors for classification. MAAG is designed to be back end agnostic, with base models including Mistral 7B, Vicuna 7B, Vicuna 13B, Llama2 7B, and Llama3 8B, and the simulation backed by GPT 4o mini in the experiments. The framework supports continual memory refinement to counter evolving jailbreak tactics without explicit retraining of the LLMs.

Key Findings

MAAG outperforms state of the art across five open source models and six jailbreak attacks, demonstrating strong adaptive detection capabilities through memory based matching and post generation verification.
Average detection accuracy reported in the experiments is 85 per cent across the tested model set and attack types, with F1 scores ranging by attack and model, including high F1 values on some novel attacks (for example near 0.99 on several simpler attacks) and lower values on highly obfuscated prompts such as Base64 and Zulu for some backbones.
On the XSTest and other safety benchmarks MAAG achieves high accuracy, in some scenarios near 97 per cent for challenging attacks, and shows robustness across different LLM backbones without requiring fine tuning or retraining.
MAAG maintains good generalisation to unseen threats, with roughly 85 per cent accuracy on WildBench, a zero day style dataset, exceeding static detectors which show substantial drops under attack distribution shift.
An ablation study demonstrates that removing immune detection or response simulation greatly degrades performance, while memory update enables long term adaptability; the four step case study illustrates explainability with intermediate agent outputs guiding the final decision.
The framework introduces a mild latency increase of about ten seconds per query due to the multi agent pipeline, but achieves substantial gains in reliability and reduces the need for expensive model re training.

Limitations

Potential risks include the memory bank becoming an attack surface through poisoning, tampering or leakage of prompts or activation states. Secure memory isolation, integrity checks and careful evaluation of false positives are required. Additional constraints include compute and memory scalability, handling model drift over time, and defence against memory targeted evasion. The authors acknowledge these factors and advocate for robust evaluation before production deployment.

Why It Matters

MAAG offers an adaptive defence mechanism that can rapidly recognise evolving jailbreak prompts by reusing learned attack signatures, potentially improving detection speed and robustness without retraining. It enhances practical LLM safety by reducing the likelihood of harmful content generation, with implications for safer deployment in critical domains and for mitigating misuse such as disinformation or illicit guidance. Real world deployment will require careful assessment of compute/latency, memory scalability, drift handling and defence against memory targeted attacks.

Attribution Original paper on arXiv