New lightweight guard catches adversarial prompts fast

Defenses

Published: Fri, Sep 19, 2025 • By Adrian Calder

New lightweight guard catches adversarial prompts fast

Researchers introduce ADRAG, an adversarially trained, retrieval-augmented guard that distils a high-capacity teacher into a compact real-time detector. A 149M-parameter ADRAG reaches 98.5% of a 7B model's performance, beats GPT-4 on out-of-distribution detection, and cuts latency up to 5.6x at 300 queries per second, easing live moderation.

Lede: A new paper presents ADRAG (Adversarial Distilled Retrieval-Augmented Guard), a two-stage approach that aims to stop malicious prompts in real time without a room-sized model. This matters because operators need fast, reliable intent detection at scale when large language models (LLMs) power chatbots and automation.

Nut graf: ADRAG combines adversarial training, retrieval of safety exemplars, and selective distillation to move robustness from a heavyweight teacher into a compact student that can make decisions in milliseconds. For practitioners and decision makers the trade-off is familiar: slightly more engineering complexity to avoid expensive model inferencing while preserving safety coverage and lowering latency.

Background: The paper trains a large teacher on adversarially perturbed, retrieval-augmented inputs so it learns tougher decision boundaries. It then uses a distillation scheduler and an evolving online knowledge base of safety exemplars to transfer that knowledge into a 149 million parameter student.

How it works: At inference the compact student retrieves the top-K similar safety exemplars from the online knowledge base and classifies the incoming query with that context. The authors simulate noisy retrieval and attacks during training (RAFT) and use Selective Knowledge Distillation (SKD) to keep the student small but capable.

Impact and risk: Measured across ten safety benchmarks, ADRAG’s 149M variant reaches 98.5% of WildGuard-7B and outperforms GPT-4 on out-of-distribution detection by 3.3 percentage points, while reducing latency up to 5.6x at 300 QPS. That makes it attractive for production moderation and low-latency control planes. The catch is new attack surfaces: an evolving knowledge base and retrieval pipeline can be poisoned or manipulated to mislead the detector, and distillation can carry latent vulnerabilities into the compact model.

Mitigations and what to do next: The authors recommend adversarial training for the teacher, guarded procedures for updating the knowledge base, retrieval filtering and monitoring, and staged distillation to retain crucial safety knowledge. Operational teams should treat exemplar stores like any other trust boundary: version, sign and audit updates, simulate poisoning, and run red-teaming focused on retrieval attacks.

Limitations and caveats: ADRAG’s gains depend on the base model quality and clean exemplars. Distillation reduces footprint but may not eliminate subtle failure modes. Latency optimisations help availability but should not replace layered defences.

Forward look: ADRAG points to a practical path for real-time guarding that balances accuracy and speed, but deploying it safely requires disciplined lifecycle controls for the knowledge base and continuous adversarial testing.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Adversarial Distilled Retrieval-Augmented Guarding Model for Online Malicious Intent Detection

Authors: Yihao Guo, Haocheng Bian, Liutong Zhou, Ze Wang, Zhaoyi Zhang, Francois Kawala, Milan Dean, Ian Fischer, Yuantao Peng, Noyan Tokgozoglu, Ivan Barrientos, Riyaaz Shaik, Rachel Li, Chandru Venkataraman, Reza Shifteh Far, Moses Pawar, Venkat Sundaranatha, Michael Xu, and Frank Chu

With the deployment of Large Language Models (LLMs) in interactive applications, online malicious intent detection has become increasingly critical. However, existing approaches fall short of handling diverse and complex user queries in real time. To address these challenges, we introduce ADRAG (Adversarial Distilled Retrieval-Augmented Guard), a two-stage framework for robust and efficient online malicious intent detection. In the training stage, a high-capacity teacher model is trained on adversarially perturbed, retrieval-augmented inputs to learn robust decision boundaries over diverse and complex user queries. In the inference stage, a distillation scheduler transfers the teacher's knowledge into a compact student model, with a continually updated knowledge base collected online. At deployment, the compact student model leverages top-K similar safety exemplars retrieved from the online-updated knowledge base to enable both online and real-time malicious query detection. Evaluations across ten safety benchmarks demonstrate that ADRAG, with a 149M-parameter model, achieves 98.5% of WildGuard-7B's performance, surpasses GPT-4 by 3.3% and Llama-Guard-3-8B by 9.5% on out-of-distribution detection, while simultaneously delivering up to 5.6x lower latency at 300 queries per second (QPS) in real-time applications.

🔍 ShortSpan Analysis of the Paper

Problem

Online malicious intent detection is increasingly required as LLMs operate in real time. Existing approaches struggle to handle diverse and complex user queries quickly while maintaining accuracy, integrity and safety in dynamic online environments. This work addresses the need for robust, real time detection that can adapt to evolving threats without excessive latency.

Approach

The authors propose ADRAG, a two stage framework combining adversarial training and retrieval augmented learning with efficient distillation. In training, a high capacity teacher model is trained on adversarially perturbed, retrieval augmented inputs to learn robust decision boundaries over diverse queries. For inference, a distillation scheduler transfers knowledge from the teacher to a compact student model, while an online evolving knowledge base provides continually updated safety exemplars. At deployment, the student uses top K retrieved safety exemplars from the online knowledge base to perform real time malicious intent detection. ADRAG introduces Retrieval Augmented Adversarial Fine Tuning RAFT to simulate noise in knowledge base construction retrieval and user input, and Selective Knowledge Distillation SKD to compress the teacher into a fast student with an encoder scheduler governing staged distillation. An evolving knowledge base enriched by user feedback and policy guided synthetic generation, together with selective retrieval of context, supports continual adaptation while maintaining low latency. The input for retrieval augments the query with retrieved context and prior query label pairs and is then fed into the classifier. The training objective integrates standard classification loss with adversarial and contextual losses to improve robustness and accuracy under perturbations.

Key Findings

ADRAG with a 149 million parameter model achieves 98.5 per cent of the performance of WildGuard 7B on ten safety benchmarks, while delivering up to 5.6 times lower latency at 300 queries per second.
On out of distribution detection, ADRAG surpasses GPT 4 by 3.3 percentage points and Llama Guard 3 8B by 9.5 percentage points on average.
End to end latency remains below 9 ms at 300 QPS, with retrieval contributing approximately 4 ms to the total latency.
The distilled variant ADRAG SKD preserves most accuracy with even lower latency, achieving p90 and p99 latencies of around 2.8 to 3.1 ms in some configurations.
Across in distribution and out of distribution tests, ADRAG shows strong performance on several datasets including improved results on Dynahate and JailJudge, with competitive results on ToxicChat and ToxiGen; WildJailbreak and CounterfactualFair show variable results as model sizes change.
Retrieval augmentation improves generalisation but tuning retrieval thresholds alone is insufficient; RAFT provides robust context aware learning and high context coverage across datasets (>97 per cent on most benchmarks).
Ablation studies indicate RAFT yields the best overall robustness gains, while SKD offers strong efficiency gains and preserves key knowledge during distillation; combining RAFT and SKD delivers the best accuracy efficiency trade offs.
Extensive evaluation across model sizes shows ADRAGs effectiveness scales with base model capacity, with stronger base models benefitting more from RAFT and SKD.

Limitations

ADRAG relies on the strengths of its base models and the quality of the retrieval component; online knowledge base updates introduce potential data poisoning and prompt injection risks for safety exemplars, and the retrieval and topology of the knowledge base can be manipulated to mislead the detector. Distilling to a compact model may transfer latent vulnerabilities from the teacher, and latency optimisations could be exploited if defenses fail. The authors propose mitigations including adversarially trained teachers, retrieval augmented safety exemplars, continuous safeguards for the evolving knowledge base, and selective distillation to balance stability and adaptability while maintaining performance. They also note diminishing returns in domains where baseline performance is near ceiling and plan future work on dynamic memory updates and multilingual or multimodal extensions.

Why It Matters

The work demonstrates a practical, robust real time guard for detecting malicious prompts in LLM interactions, combining adversarial training with retrieval augmented safety exemplars and efficient distillation. The approach has implications for moderation quality and user privacy, with trade offs between safety and freedom of expression. In security terms, ADRAG addresses evasion by adversarial inputs through a defence in depth approach, yet relies on safeguards to defend against data poisoning of the online knowledge base and potential retrieval based manipulation. The framework offers a path toward scalable, responsive moderation capable of adapting to evolving threats in live systems.

Attribution Original paper on arXiv