Confundo Crafts Robust Poisons for RAG Systems

Attacks

Published: Mon, Feb 09, 2026 • By Natalie Kestrel

Confundo Crafts Robust Poisons for RAG Systems

New research presents Confundo, a learning-to-poison framework that fine-tunes a large language model (LLM) to generate stealthy, robust poisoned content for retrieval-augmented generation (RAG) systems. Confundo survives realistic preprocessing and varied queries, manipulates facts, biases opinions and induces hallucinations while exposing gaps in ingestion, provenance and defensive testing.

There is a pleasing simplicity to the usual story about poisoning attacks on retrieval-augmented generation (RAG): an adversary plants malicious text in a knowledge source, the retriever finds it, the generator repeats it, and bad things happen. Real systems are messier. Documents get preprocessed, split into chunks, normalised and indexed, and users ask questions in ways attackers do not always predict. Confundo is new work that stops pretending those details are incidental.

What Confundo changes

Retrieval-augmented generation (RAG) systems combine a retriever with a generator, often a large language model (LLM), to ground answers in external documents. Prior poisoning experiments often assume the attacker can place a perfect, intact payload and that victims will query in foreseeable ways. The authors show those assumptions hide practical failure modes: fragmentation of poisoned text during chunking and lexical variation in user queries both degrade attack impact.

Confundo treats poisoning as an optimisation problem rather than a prompt trick. The researchers fine-tune a small LLM as a poison generator and train it against surrogate components that emulate realistic RAG pipelines. The training objective combines three pressures: make the poison effective for the chosen attack goal, make it robust to unknown preprocessing and query variation, and make it stealthy enough to look like legitimate content.

Concretely, the framework uses indirect reward signals from surrogate retrievers and a surrogate generator. Rewards include a BM25-like lexical similarity measure, an ensemble of embedding models for semantic alignment, and a generation-based reward keyed to the attack objective. The design also simulates document fragmentation with random prefix and suffix splits, augments text to increase lexical variation, and enforces fluency so poison passages remain natural and hard to flag. Training proceeds through a warm-up and a fine-tuning phase using a group-relative policy optimisation routine.

The paper demonstrates three attack objectives within the same framework: manipulate factual correctness, bias opinions, and induce hallucinations. Across datasets and RAG configurations Confundo outperforms a range of purpose-built attacks and shows better transfer to unknown chunk sizes, retrievers and generators. Defences such as perplexity-based detection, reranking and paraphrasing reduce effectiveness but do not fully eliminate the residual risk.

There is a defensive twist worth noting. The same generation techniques can be used to inject harmless-but-poisonous content into web pages to deter unauthorised scraping and inclusion in third-party RAG systems. That is an interesting guardrail, but it does not replace provenance controls and ingestion policies.

The work is not a prophecy of inevitable chaos. The threat model assumes an attacker who can publish content that will be ingested, and the results depend on how well surrogate models match a target deployment. Still, Confundo closes an important gap between theoretical attacks and the realities of deployed RAG pipelines and forces a rethink of standard testing.

Practical checks for teams

Simulate ingestion: run red-team experiments that include document preprocessing, chunking and indexing variants to see if poisoned fragments survive.
Test retrieval diversity: evaluate your system with multiple retrievers and embedding models and measure dominance of single sources in generator context.
Enforce provenance and ingestion controls: require machine-readable policies for crawling and audit the source and frequency of newly ingested content.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Authors: Haoyang Hu, Zhejun Jiang, Yueming Lyu, Junyuan Zhang, Yi Liu, and Ka-Ho Chow

Retrieval-augmented generation (RAG) is increasingly deployed in real-world applications, where its reference-grounded design makes outputs appear trustworthy. This trust has spurred research on poisoning attacks that craft malicious content, inject it into knowledge sources, and manipulate RAG responses. However, when evaluated in practical RAG systems, existing attacks suffer from severely degraded effectiveness. This gap stems from two overlooked realities: (i) content is often processed before use, which can fragment the poison and weaken its effect, and (ii) users often do not issue the exact queries anticipated during attack design. These factors can lead practitioners to underestimate risks and develop a false sense of security. To better characterize the threat to practical systems, we present Confundo, a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness. Confundo provides a unified framework supporting multiple attack objectives, demonstrated by manipulating factual correctness, inducing biased opinions, and triggering hallucinations. By addressing these overlooked challenges, Confundo consistently outperforms a wide range of purpose-built attacks across datasets and RAG configurations by large margins, even in the presence of defenses. Beyond exposing vulnerabilities, we also present a defensive use case that protects web content from unauthorized incorporation into RAG systems via scraping, with no impact on user experience.

🔍 ShortSpan Analysis of the Paper

Problem

Retrieval augmented generation relies on external knowledge sources to ground responses, which creates an attack surface for poisoning knowledge. While prior studies report high attack effectiveness, their evaluations often ignore real world RAG pipelines where documents are processed, chunked, and indexed, fragmenting poisons and reducing impact. Additionally, users rarely pose exactly the queries assumed during attack design, further eroding practical risk assessments. This paper argues that practical deployments are more vulnerable than theory suggests and calls for robust evaluation and mitigation. It also presents a defensive use case in which poisoned web content is injected to prevent unauthorized incorporation into RAG systems, highlighting guardrails based on provenance and ingestion controls as part of security strategy.

Approach

Confundo is a learning to poison framework that fine tunes a large language model as a poison generator to achieve high effectiveness, robustness and stealthiness in realistic RAG settings. It supports multiple attack objectives including manipulating factual correctness, inducing biased opinions and triggering hallucinations. The framework optimises three components: attack objective optimisation, robustness optimisation for lexical variation and unknown preprocessing, and stealthiness optimisation to produce undetectable poison texts. The attack operates under a threat model where the attacker cannot access the target RAG system but can inject documents that will be ingested into the knowledge base. After ingestion, the RAG system retrieves relevant entries and generates answers conditioned on retrieved content. Confundo shifts from prompt engineering to optimisation by fine tuning a poison generator G_theta using indirect supervisory signals derived from surrogate components. These include surrogate retrievers and a surrogate LLM to evaluate whether the poison text can induce the target misbehaviour. The reward signals combine lexical similarity via a BM25 like measure, semantic similarity through an ensemble of embedding models, and a generation based reward that depends on the chosen objective. The design also includes simulating poison fragmentation by random prefix suffix splits and employing data augmentation to improve lexical robustness and a fluency reward to maintain natural language quality. Training proceeds in a warm up phase to normalise rewards and a fine tuning phase using Group Relative Policy Optimisation. Experimental setup uses a fine tuned Qwen3 0.6b as the poison generator and surrogate components to evaluate across different target RAG configurations and generators, measuring the ability to misbehave under realistic conditions.

Key Findings

Confundo consistently outperforms a range of purpose built attacks across datasets and RAG configurations, maintaining effectiveness under realistic preprocessing and unknown system configurations.
It demonstrates three attack objectives high effectiveness for factual correctness manipulation, strong biasing for opinion manipulation and substantial improvement in hallucination induction compared with baselines.
The method shows robust generalisation across unknown chunk sizes, unknown retrievers and unknown generators, with poison texts retrieved reliably and dominating the input to the generator in diverse RAG setups.
Defences such as perplexity based detection, reranking and paraphrasing reduce effectiveness but do not eliminate Confundo, indicating residual risk and the need for stronger testing and guardrails. The framework also offers a defensive use case to deter unauthorized scraping by injecting poisoned content into web sources while remaining visually indistinguishable to users.

Limitations

The attacker model assumes no internal access to the target system and relies on surrogate retrievers and generators to estimate performance, which may not capture every real world configuration. Evaluation covers specific RAG setups and datasets, and results depend on the alignment between surrogate components and target systems. Defences such as paraphrasing and reranking reduce but do not remove Confundo effectiveness, and transferability to highly diverse or unseen target systems may vary. The work acknowledges gaps between experimental settings and some real world deployments and suggests the need for robust, multi objective defence testing.

Why It Matters

Confundo highlights practical vulnerabilities in retrieval augmented generation and argues that real world deployments are susceptible to multi objective poisoning despite theoretical assurances. By enabling robust, stealthy poisoning across multiple objectives within a single framework, Confundo underscores the need for robust evaluation, provenance based controls and ingestion safeguards as part of security strategy. The defensive use case demonstrates how poison generation can be employed to protect web content from unauthorised incorporation into RAG pipelines, reinforcing the importance of content provenance, machine readable ingestion controls and guardrails in real world deployments.

Attribution Original paper on arXiv