KidnapRAG hijacks agentic RAG with black-box poisoning
KidnapRAG shows how to hijack agentic Retrieval-Augmented Generation using only public poisoned documents. The attacker lures initial retrieval, drags the chain with follow-ups, then supplies malicious evidence. It beats other black-box attacks across ReAct and WebThinker setups and stays stealthier under moderation, raising reliability risks for deployed LLM agents.
Agentic Retrieval-Augmented Generation (RAG) was meant to shrug off simple poisoning. It retrieves, reasons, and retrieves again, which usually filters out weakly relevant junk. KidnapRAG shows you can still take the wheel from the outside. No prompt access, no model weights, just public pages that get indexed and a view of whatever sub-queries or sources the agent exposes.
In one sentence: if you can influence what an agent retrieves at each step, you can bend its reasoning and final answer.
How the hijack works
The attack runs as a sequence with three document roles. First, a Bait document is crafted to look relevant to the original query and to nudge a rare follow-up topic. That rare domain matters because it tilts later retrieval away from the genuine path. Next come Chain-Link documents that keep the agent asking the “right” follow-ups, so each step pulls in more attacker-authored content. Finally, a Mal-Ins document provides the evidence the attacker wants the agent to rely on for its conclusion.
The setup is black-box. The attacker probes the agent with a normal query, watches any exposed intermediate signals like sub-queries or retrieved snippets, then publishes the Bait, Chain-Link, and Mal-Ins pages optimised for what the agent is likely to ask next. As the agent iterates, “chain dragging” kicks in: the more consecutive steps retrieve Chain-Link pages, the weaker the original intent becomes and the stronger the pull toward the attacker’s target.
Why this bites
Across two agent frameworks (ReAct and WebThinker), several Large Language Model (LLM) backbones, and multi-hop QA datasets (HotpotQA, MuSiQue, 2WikiMultihopQA), KidnapRAG outperforms other black-box poisoning baselines. Single-shot poisoning rarely works against agentic systems; the chained approach does. The components are complementary too: remove any piece and effectiveness drops, and without the Mal-Ins page the attack does not land. Merging everything into one page also hurts success, which fits the idea that multi-step steering beats a blunt hit.
Two details will worry operators. First, reasoning-path metrics show the agent progressively diverges from the clean route and starts preferring the attacker’s answer. Second, moderation of the cumulative trace flags it less often than flashier one-shot attacks. Sequential nudges look normal.
There are limits. Success depends on your pages getting indexed and ranked, and on how much of the agent’s chain is visible. Source filtering, domain authority, and platform policy can blunt retrieval. The work does not cover planners, verifiers, multi-agent setups, or tight retrieval–verification loops. The open question is whether those designs can break the chain-dragging pattern without killing utility.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
KidnapRAG: A Black-Box Attack for Hijacking Reasoning in Agentic Retrieval-Augmented Generation Systems
🔍 ShortSpan Analysis of the Paper
Problem
The paper examines whether agentic Retrieval-Augmented Generation systems can be poisoned by an attacker who has only black-box capabilities: the ability to publish documents that may be retrieved and to observe the agent's public reasoning outputs such as intermediate thoughts, generated search queries, retrieved contents and source information. This matters because Agentic RAG systems perform iterative retrieval and reasoning to resist naive poisoning, but exposed intermediate signals may enable attackers to hijack the multi-step reasoning chain and cause harmful or misleading final outputs even without access to prompts, retrievers or model internals.
Approach
The authors propose KidnapRAG, a sequential poisoning strategy that uses three role-specific documents to redirect and sustain an agent's reasoning path: Bait documents attract initial retrieval and induce a rare-domain follow-up query, Chain-Link documents sustain the redirected chain across steps, and Mal-Ins documents supply the attacker-controlled evidence that induces the final target response. The threat model assumes no internal access; the attacker first profiles exposed reasoning by submitting a query, then uploads crafted documents optimised for observed subqueries. Experiments use multi-hop QA datasets HotpotQA, MuSiQue and 2WikiMultihopQA, two agentic frameworks (ReAct and WebThinker) with multiple LLM backbones, and a fixed retriever e5-large-v2. Attack effectiveness is measured by Exact Match and Attack Success Rate, plus diagnostic metrics that quantify reasoning-path divergence, target redirection, and answer preference shifts.
Key Findings
- KidnapRAG consistently outperforms seven black-box poisoning baselines and causes the largest performance degradation across combinations of agent frameworks, LLM backbones and datasets, indicating single-step poisoning is insufficient against Agentic RAG.
- The three-document design is complementary: removing the Bait, Chain-Link or Mal-Ins components reduces effectiveness; in particular removing the Mal-Ins document drops attack success rate to zero, while merging all components into one document substantially lowers success.
- Chain Dragging, the sequential retrieval of Chain-Link documents, increases attack success as the reasoning chain is progressively redirected; gains can saturate depending on the agent, but sustained multi-step guidance is key to success.
- Reasoning-chain analyses show KidnapRAG both increases divergence from the clean reasoning path and shifts semantic direction toward the attacker target, with models' answer preference moving toward the attacker-intended output across steps.
- The chained design is less detectable by an OpenAI Moderation check on cumulative reasoning traces, yielding lower guard rates than more overt single-shot attacks.
Limitations
Effectiveness depends on availability and granularity of exposed reasoning traces and on whether attacker-published content is indexed and ranked by the retriever; indexing latency, source filtering, domain authority and platform policies may reduce retrieval success. Experiments cover two representative agent architectures and specific LLM backbones and do not exhaustively evaluate systems with planners, verifiers, multi-agent setups or integrated retrieval-verification defences.
Implications
An attacker who can publish public documents and observe an agent's intermediate queries can steer an Agentic RAG system to produce attacker-intended outputs including harmful instructions or security-sensitive commands. By sequentially redirecting retrieval intent into a rare-domain chain, adversaries can make malicious evidence appear plausible and evade simple moderation checks. This threat enables reproducible red-teaming but also means deployed agentic RAG services used in high-impact domains could be stealthily compromised without internal access.