Defend RAG Systems Against Knowledge Poisoning
Defenses
Large Language Models (LLM) paired with external knowledge stores are useful, predictable or otherwise. Retrieval-Augmented Generation, or RAG, is the common architecture: a retriever finds relevant passages and a generator uses them to answer questions. That combination fixes some hallucinations but introduces a fresh attack surface. An adversary who can inject or poison documents in the backing corpus can steer outputs by corrupting the retrieved context. That is knowledge corruption, and it is a practical, under-appreciated risk for web services that rely on external or crawlable sources.
How RAGDefender works
The paper presents RAGDefender, a defence that sits after retrieval and before generation. Its selling point is pragmatic: it does not require retraining models or calling the LLM for extra checks. Instead it uses lightweight techniques to group and score retrieved passages, then filters probable adversarial items. For single-hop queries it applies hierarchical clustering with TF-IDF to spot anomalous clusters. For multi-hop queries it looks at concentration in embedding space. A second stage ranks passages by frequency in top-similar pairs and semantic relations to identify and remove the likely poisoned ones. The implementation uses Sentence Transformers with the Stella embedding, FAISS for storage and scikit-learn for clustering.
The empirical results are striking, at least on the tested setups. On Natural Questions with four adversarial passages per benign passage, the Gemini model's attack success rate (ASR) falls from 0.89 to 0.02 with RAGDefender. Competing approaches reported higher residual ASR: RobustRAG 0.69 and Discern-and-Answer 0.24 in the same scenario. Across MS MARCO and multiple retrievers the defence also achieved low ASR and improved accuracy. It is far cheaper to run too: roughly 12.3 times faster than RobustRAG in their measurements and it needs no GPU memory, which matters if you are protecting a live service rather than training in a research cluster.
Limitations and what to do next
This is not a magic wand. The method was validated on specific English corpora and typical retrieval sizes (k around 3 to 5). The authors note there are no formal guarantees that adversarial passages always cluster densely, and stronger adaptive attackers remain an open problem. Do not assume identical results on multilingual, multimodal or very large retrieval sets without testing.
Why it matters: if you operate a RAG system, a low-cost post-retrieval filter that meaningfully reduces ASR is a realistic tool to improve integrity without the expense of retraining or large inference budgets. Practical next steps: test RAGDefender on your corpora and retrieval settings, instrument ASR-style metrics during red-team exercises, and combine post-retrieval filtering with provenance and access controls on your knowledge sources.
- Run controlled poisoning tests against your RAG pipeline.
- Evaluate RAGDefender latency and false positive trade-offs on real data.
- Harden ingestion and provenance for external content alongside filtering.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems
🔍 ShortSpan Analysis of the Paper
Problem
RAG systems combining retrievers and generators are vulnerable to knowledge corruption caused by data poisoning, where adversarial passages are injected into knowledge bases to mislead outputs. Existing defenses incur substantial computational costs or retraining, hindering practical deployment in web services that rely on external sources.
Approach
The authors propose RAGDefender, a lightweight post retrieval defence that does not require retraining or extra LLM inferences. It operates in two stages. Stage one groups retrieved passages to estimate the number of adversarial passages using two strategies: clustering based on hierarchical agglomerative clustering with TF IDF for single hop questions, and concentration based grouping using embedding space concentration factors for multi hop questions. Stage two identifies adversarial passages by ranking passages by frequency of occurrence in top similar pairs and semantic relations, guided by the estimated number of adversarial passages. The safe passages are passed to the generator. The system uses Sentence Transformers with the Stella embedding model, FAISS for storage, and scikit-learn for clustering, and is compatible with various RAG architectures. The approach is designed to be efficient by using TF IDF to identify clusters and avoiding extra model training or inferences.
Key Findings
- RAGDefender consistently outperforms state of the art across multiple models and adversarial scenarios in terms of lower attack success rate ASR and higher accuracy.
- On Natural Questions with four adversarial to benign passages (4×4×), ASR for Gemini drops from 0.89 to 0.02, while RobustRAG yields 0.69 and Discern and Answer 0.24.
- On MS MARCO, across three retrieval models, the highest level of protection yields ASR as low as 0.04 with Gemini, and across configurations ASR can be as low as 0.04 while achieving higher accuracy than competing methods; RAGDefender achieves low ASR across models and yields substantial accuracy gains.
- RAGDefender offers substantial speed advantages over RobustRAG, about 12.3x faster, and uses no GPU memory, unlike competing methods which incur large memory footprints during fine tuning or inference.
- Across multiple datasets and architectures, RAGDefender maintains low ASR and high accuracy, showing robustness to various poisoning tactics including PoisonedRAG, GARAG and Tan et al methods.
- The two stage design improves robustness, with the combination outperforming either stage alone; mis partitioning scenarios are mitigated by stage two using semantic relationships and top terms.
Limitations
Effectiveness beyond the tested corpora, including multimodal or multilingual data, is not established. Performance may vary with larger or unusual retrieval sizes beyond the typical k around 3 to 5. There are no formal theoretical guarantees that adversarial passages form dense clusters. Stronger adaptive strategies remain a challenge for future work.
Why It Matters
RAGDefender reduces the risk of knowledge corruption in AI services that rely on external sources, helping to curb misinformation and maintain trust and reliability in AI enabled systems across critical domains. Its resource efficiency and compatibility with existing RAG pipelines enhance practical deployment in real world security contexts.