Defend RAG Systems Against Knowledge Poisoning

Defenses

Published: Tue, Nov 04, 2025 • By Adrian Calder

Defend RAG Systems Against Knowledge Poisoning

RAGDefender offers a lightweight post-retrieval defence against knowledge-poisoning attacks on Retrieval-Augmented Generation (RAG) systems. Without retraining or extra LLM inferences it filters poisoned passages, sharply reducing attack success rates in tests (eg lowering Gemini ASR from 0.89 to 0.02) while running faster and using no GPU memory.

Large Language Models (LLM) paired with external knowledge stores are useful, predictable or otherwise. Retrieval-Augmented Generation, or RAG, is the common architecture: a retriever finds relevant passages and a generator uses them to answer questions. That combination fixes some hallucinations but introduces a fresh attack surface. An adversary who can inject or poison documents in the backing corpus can steer outputs by corrupting the retrieved context. That is knowledge corruption, and it is a practical, under-appreciated risk for web services that rely on external or crawlable sources.

How RAGDefender works

The paper presents RAGDefender, a defence that sits after retrieval and before generation. Its selling point is pragmatic: it does not require retraining models or calling the LLM for extra checks. Instead it uses lightweight techniques to group and score retrieved passages, then filters probable adversarial items. For single-hop queries it applies hierarchical clustering with TF-IDF to spot anomalous clusters. For multi-hop queries it looks at concentration in embedding space. A second stage ranks passages by frequency in top-similar pairs and semantic relations to identify and remove the likely poisoned ones. The implementation uses Sentence Transformers with the Stella embedding, FAISS for storage and scikit-learn for clustering.

The empirical results are striking, at least on the tested setups. On Natural Questions with four adversarial passages per benign passage, the Gemini model's attack success rate (ASR) falls from 0.89 to 0.02 with RAGDefender. Competing approaches reported higher residual ASR: RobustRAG 0.69 and Discern-and-Answer 0.24 in the same scenario. Across MS MARCO and multiple retrievers the defence also achieved low ASR and improved accuracy. It is far cheaper to run too: roughly 12.3 times faster than RobustRAG in their measurements and it needs no GPU memory, which matters if you are protecting a live service rather than training in a research cluster.

Limitations and what to do next

This is not a magic wand. The method was validated on specific English corpora and typical retrieval sizes (k around 3 to 5). The authors note there are no formal guarantees that adversarial passages always cluster densely, and stronger adaptive attackers remain an open problem. Do not assume identical results on multilingual, multimodal or very large retrieval sets without testing.

Why it matters: if you operate a RAG system, a low-cost post-retrieval filter that meaningfully reduces ASR is a realistic tool to improve integrity without the expense of retraining or large inference budgets. Practical next steps: test RAGDefender on your corpora and retrieval settings, instrument ASR-style metrics during red-team exercises, and combine post-retrieval filtering with provenance and access controls on your knowledge sources.

Run controlled poisoning tests against your RAG pipeline.
Evaluate RAGDefender latency and false positive trade-offs on real data.
Harden ingestion and provenance for external content alongside filtering.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems

Authors: Minseok Kim, Hankook Lee, and Hyungjoon Koo

Large language models (LLMs) are reshaping numerous facets of our daily lives, leading widespread adoption as web-based services. Despite their versatility, LLMs face notable challenges, such as generating hallucinated content and lacking access to up-to-date information. Lately, to address such limitations, Retrieval-Augmented Generation (RAG) has emerged as a promising direction by generating responses grounded in external knowledge sources. A typical RAG system consists of i) a retriever that probes a group of relevant passages from a knowledge base and ii) a generator that formulates a response based on the retrieved content. However, as with other AI systems, recent studies demonstrate the vulnerability of RAG, such as knowledge corruption attacks by injecting misleading information. In response, several defense strategies have been proposed, including having LLMs inspect the retrieved passages individually or fine-tuning robust retrievers. While effective, such approaches often come with substantial computational costs. In this work, we introduce RAGDefender, a resource-efficient defense mechanism against knowledge corruption (i.e., by data poisoning) attacks in practical RAG deployments. RAGDefender operates during the post-retrieval phase, leveraging lightweight machine learning techniques to detect and filter out adversarial content without requiring additional model training or inference. Our empirical evaluations show that RAGDefender consistently outperforms existing state-of-the-art defenses across multiple models and adversarial scenarios: e.g., RAGDefender reduces the attack success rate (ASR) against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber legitimate ones by a factor of four (4x).

🔍 ShortSpan Analysis of the Paper

Problem

RAG systems combining retrievers and generators are vulnerable to knowledge corruption caused by data poisoning, where adversarial passages are injected into knowledge bases to mislead outputs. Existing defenses incur substantial computational costs or retraining, hindering practical deployment in web services that rely on external sources.

Approach

The authors propose RAGDefender, a lightweight post retrieval defence that does not require retraining or extra LLM inferences. It operates in two stages. Stage one groups retrieved passages to estimate the number of adversarial passages using two strategies: clustering based on hierarchical agglomerative clustering with TF IDF for single hop questions, and concentration based grouping using embedding space concentration factors for multi hop questions. Stage two identifies adversarial passages by ranking passages by frequency of occurrence in top similar pairs and semantic relations, guided by the estimated number of adversarial passages. The safe passages are passed to the generator. The system uses Sentence Transformers with the Stella embedding model, FAISS for storage, and scikit-learn for clustering, and is compatible with various RAG architectures. The approach is designed to be efficient by using TF IDF to identify clusters and avoiding extra model training or inferences.

Key Findings

RAGDefender consistently outperforms state of the art across multiple models and adversarial scenarios in terms of lower attack success rate ASR and higher accuracy.
On Natural Questions with four adversarial to benign passages (4×4×), ASR for Gemini drops from 0.89 to 0.02, while RobustRAG yields 0.69 and Discern and Answer 0.24.
On MS MARCO, across three retrieval models, the highest level of protection yields ASR as low as 0.04 with Gemini, and across configurations ASR can be as low as 0.04 while achieving higher accuracy than competing methods; RAGDefender achieves low ASR across models and yields substantial accuracy gains.
RAGDefender offers substantial speed advantages over RobustRAG, about 12.3x faster, and uses no GPU memory, unlike competing methods which incur large memory footprints during fine tuning or inference.
Across multiple datasets and architectures, RAGDefender maintains low ASR and high accuracy, showing robustness to various poisoning tactics including PoisonedRAG, GARAG and Tan et al methods.
The two stage design improves robustness, with the combination outperforming either stage alone; mis partitioning scenarios are mitigated by stage two using semantic relationships and top terms.

Limitations

Effectiveness beyond the tested corpora, including multimodal or multilingual data, is not established. Performance may vary with larger or unusual retrieval sizes beyond the typical k around 3 to 5. There are no formal theoretical guarantees that adversarial passages form dense clusters. Stronger adaptive strategies remain a challenge for future work.

Why It Matters

RAGDefender reduces the risk of knowledge corruption in AI services that rely on external sources, helping to curb misinformation and maintain trust and reliability in AI enabled systems across critical domains. Its resource efficiency and compatibility with existing RAG pipelines enhance practical deployment in real world security contexts.

Attribution Original paper on arXiv