New tool traces poisoned texts in RAG systems

Defenses

Published: Thu, Sep 18, 2025 • By Natalie Kestrel

New tool traces poisoned texts in RAG systems

Researchers introduce RAGOrigin, a black-box method that identifies which documents in a Retrieval-Augmented Generation (RAG) knowledge base cause incorrect or malicious outputs. The approach combines retrieval rank, semantic signals and generation influence, then clusters candidates. It reports low false positives and negatives, scales to millions of texts and enables targeted removal to stop attacks.

A new paper presents RAGOrigin, a responsibility attribution method that helps answer a blunt question: who taught the lie in Retrieval-Augmented Generation (RAG) systems. That matters because RAG systems pair a Large Language Model (LLM) with external text databases, and poisoned texts there can steer outputs in ways that matter for security and compliance.

The practical stakes are simple for security teams. If you can point to the specific text that caused a misgeneration, you can quarantine it, audit the provider and reduce harm without retraining models or tearing down pipelines. The paper evaluates the approach across seven datasets and 15 attack variants, including adaptive and multi-attacker scenarios.

How it works

RAGOrigin operates as a black-box forensic: it assumes access to the knowledge store and the misgenerated response but not to model internals. It narrows the universe of candidate texts by building an adaptive attribution scope based on retrieval similarity, then scores each candidate using three signals: retrieval ranking, embedding or semantic relevance judged by a proxy LLM, and a generation-influence test that measures whether a single text can drive the incorrect output. A simple clustering step separates likely poisoned texts from benign ones without labels.

In experiments the method outperforms prior baselines, reports low false positive rates (at most 0.03) and false negative rates (at most 0.01), scales to large knowledge bases (tested up to 16.7 million texts), and achieves modest per-event cost and latency (about 2.11 seconds and roughly 0.0004 USD per event using a modern proxy model in the study). Removing flagged texts often reduces attack success to near zero.

That is not a silver bullet. The paper assumes honest reporting of misgenerations and does not tackle false-flag attacks where adversaries report benign outputs as malicious. Multi-turn dialogues and non-targeted poisoning receive limited attention. The method also requires read access to the knowledge base, which some vendors may not allow.

Practical checks

Reproduce the misgeneration against the current retriever and inspect top-K sources for recent or unexpected additions.
Score suspect texts by retrieval rank, embedding similarity and influence on generation, then cluster high scorers for human review.
Log provenance for every document and require human verification before removing content, to reduce risk of false-flag removals.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

Authors: Baolei Zhang, Haoran Xin, Yuxi Chen, Zhuqing Liu, Biao Yi, Tong Li, Lihai Nie, Zheli Liu, and Minghong Fang

Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks. This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems.

🔍 ShortSpan Analysis of the Paper

Problem

Retrieval Augmented Generation combines external knowledge with large language models to improve responses, but it is vulnerable to poisoning attacks where malicious texts are inserted into the knowledge base to steer outputs. The paper addresses responsibility attribution for poisoned knowledge in RAG systems by asking who taught the lie and whether it is possible to identify the texts in the knowledge database that cause misgeneration. Existing forensics methods often assume access to model internals or rely on direct semantic alignment, which fail when attackers are adaptive, data are heterogeneous, or the system is black box. The problem matters for accountability, auditing data pipelines, and enabling targeted remediation in security sensitive deployments.

Approach

The authors present RAGOrigin, a black box responsibility attribution framework for RAG systems. It consists of two core components: adaptive scope construction and responsibility attribution. Given a misgeneration event, RAGOrigin narrows the search from the entire knowledge base to an attribution scope by partitioning the knowledge base into segments of size K ordered by retrieval similarity to the user question. It iteratively tests segments by simulating the RAG process and checking whether the generated output matches the misgenerated response, stopping when enough segments yield divergent outputs to capture the likely poisoned region. Within the attribution scope, it computes a responsibility score for each text using three signals: embedding similarity to the question, semantic correlation between the text and the question measured by a proxy LLM using a specific prompting scheme, and generation influence which evaluates how likely a text is to drive the incorrect response when used alone. These signals are standardised and averaged to form RS(u). A clustering step then separates poisoned texts from benign ones without labels. To avoid fixed thresholds, a dynamic two cluster threshold is applied via K means to identify the poisoned cluster. The design is intentionally black box, working with access to the knowledge base and the misgeneration event but without access to internal retriever or LLM parameters.

Key Findings

RAGOrigin consistently identifies poisoned texts with high accuracy across seven datasets and fifteen poisoning attacks, including adaptive and multi attacker scenarios, outperforming all baselines on detection metrics.
The method yields low false positive and false negative rates across datasets, achieving a low false positive rate of at most 0.03 and a false negative rate of at most 0.01 in reported results.
Compared with RAGForensics, RAGOrigin remains robust against sophisticated adaptive attacks where poisoned texts are optimised to evade semantic alignment or retrieval tricks.
Removing identified poisoned texts effectively neutralises poisoning attacks in many settings, with attack success rates dropping to near zero in several scenarios, while some baselines fail to mitigate certain attacks due to misclassification.
RAGOrigin remains effective when the user question is paraphrased or when the knowledge base is enlarged to 16.7 million texts, indicating scalability and robustness to data scale.
Performance is robust across retrievers and similarity metrics, and the approach remains effective when using various judgement and proxy models, including lightweight LLMs, indicating practical viability in real world deployments.
Computational overhead is modest and amenable to parallelisation, with total per misgeneration event costs around 2.11 seconds on a large dataset and monetary cost per event around 0.0004 USD when using GPT-4o mini as the judgement model, comparable to existing RAG workloads.
Open ended settings such as ELI5 show strong performance with a high detection rate and extremely low error rates, demonstrating generalisability beyond standard QA benchmarks.

Limitations

The study assumes users honestly report misgeneration events and does not address false flag attacks where attackers misreport to remove benign content. Multi turn dialogue scenarios are not fully explored, and the work focuses on target poisoning attacks rather than non target poisoning. Human in the loop verification is suggested as a mitigation for false reporting. The approach relies on access to the knowledge database and misgeneration events but not to model internals, which may limit applicability in some highly closed ecosystems.

Why It Matters

RAGOrigin provides a practical and effective tool for tracing the origins of corrupted knowledge in retrieval augmented generation systems, supporting threat modelling, security testing, and defense in depth for AI systems that rely on external data sources. By isolating poisoned content, it enables auditability, data pipeline remediation, and accountability for data providers and operators. The approach has societal and security implications by helping reduce misinformation and manipulation risks in high stake domains such as finance, healthcare and governance, where incorrect external knowledge can lead to harmful decisions.

Attribution Original paper on arXiv