RAG worm spreads across GenAI applications

Agents

Published: Fri, Mar 08, 2024 • By Clara Nyx

RAG worm spreads across GenAI applications

Researchers demonstrate a plausible zero-click worm, Morris-II, that uses adversarial self-replicating prompts to poison retrieval-augmented generation (RAG) pipelines and propagate across GenAI-powered email assistants. Propagation depends on context size, embedding choice and hop count. A lightweight guardrail, Virtual Donkey, detects the attack in experiments but the defence has notable limits and an arms-race risk.

Generative AI (GenAI) is moving from novelty to infrastructure. Many applications do not just produce text; they share knowledge via retrieval-augmented generation (RAG), a pattern where a model pulls contextual documents from a vector store and uses them to answer or generate content. That interconnection is convenient. It is also a new attack surface.

How Morris-II operates

The paper presents Morris-II, a zero-click worm that lives inside RAG-based pipelines. The authors craft an adversarial self-replicating prompt that combines three elements: a jailbreaking instruction to bypass normal model constraints, replication directions to contaminate RAG stores, and a payload that performs malicious actions such as extracting confidential data. Once planted in an application’s store, the prompt can be retrieved by other clients and cause further contamination, producing a chain reaction across an ecosystem of GenAI-powered email assistants.

The study runs end-to-end experiments using an ecosystem built on a LangChain client, cosine-similarity vector stores and the Enron email dataset. The authors test a real GenAI engine (Gemini Flash 1.5) and vary context window sizes, embedding algorithms and hop counts. They show propagation via three realistic routes: generating a new email from a subject, enriching an existing email, or replying to an email. For context windows close to twenty emails, combined propagation via generating new messages and enriching content reaches roughly twenty percent in some configurations. Replication and payload execution succeed at better than ninety percent in early hops, but propagation effectiveness falls off after about eleven to twelve hops as non-deterministic model behaviour breaks the chain. Payload quality also degrades as context grows; the worm sometimes hallucinates data such as fake addresses when faced with larger contexts.

Defence and limits

To counter the threat the authors propose Virtual Donkey, a lightweight guardrail that flags suspicious outputs by analysing similarity features between input and output and feeding those into a simple classifier. In the lab this approach achieves a true positive rate of 1.0 and a false positive rate of 0.015, and it generalises in out-of-distribution tests with reported area under curve values between 0.79 and 1.0. The implementation is deliberately low-cost: it avoids extra model inferences and can sit close to the application pipeline.

That sounds promising, but be realistic about what the paper delivers. Experiments use curated email corpora (Enron and a Hillary Clinton sample) and a controlled ecosystem topology. The guardrail is not evaluated against adaptive attackers who know how Virtual Donkey works, and the authors explicitly note this limitation. In short, the paper demonstrates a plausible chain-reaction attack and a detection idea that works in the lab. It does not prove the approach is robust in the wild.

For practitioners the takeaway is twofold. First, RAG introduces systemic risks that deserve cross-application threat modelling, prompt auditing and stronger trust boundaries between services. Second, lightweight detection is feasible and worth deploying as part of defence in depth, but expect an arms race: adaptive prompt engineering and embedding choices will change propagation dynamics. The paper maps a clear attack vector and an initial mitigation. The hard work is still designing deployments and governance that survive a determined, adaptive adversary.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications

Authors: Stav Cohen, Ron Bitton, and Ben Nassi

In this paper, we show that when the communication between GenAI-powered applications relies on RAG-based inference, an attacker can initiate a computer worm-like chain reaction that we call Morris-II. This is done by crafting an adversarial self-replicating prompt that triggers a cascade of indirect prompt injections within the ecosystem and forces each affected application to perform malicious actions and compromise the RAG of additional applications. We evaluate the performance of the worm in creating a chain of confidential user data extraction within a GenAI ecosystem of GenAI-powered email assistants and analyze how the performance of the worm is affected by the size of the context, the adversarial self-replicating prompt used, the type and size of the embedding algorithm employed, and the number of hops in the propagation. Finally, we introduce the Virtual Donkey, a guardrail intended to detect and prevent the propagation of Morris-II with minimal latency, high accuracy, and a low false-positive rate. We evaluate the guardrail's performance and show that it yields a perfect true-positive rate of 1.0 with a false-positive rate of 0.015, and is robust against out-of-distribution worms, consisting of unseen jailbreaking commands, a different email dataset, and various worm usecases.

🔍 ShortSpan Analysis of the Paper

Problem

This work demonstrates a plausible zero click worm, named Morris II, that can spread across ecosystems of GenAI powered applications that communicate via retrieval augmented generation. By injecting adversarial self replicating prompts into user messages, an attacker can trigger a cascade of indirect prompt injections that cause each affected application to perform malicious actions and to compromise the RAG of other applications, potentially exfiltrating confidential data across the ecosystem. The study emphasises the risk to GenAI ecosystems rather than a single application and motivates the need for guardrails and cross app trust controls.

Approach

The threat model targets RAG based GenAI powered applications that receive user inputs, actively update their databases, interconnect with other clients, and rely on RAG for inter application communication. The core mechanism is an adversarial self replicating prompt composed of a jail breaking prompt for privilege escalation, a replication instruction and a payload that executes malicious activity within the GenAI inference. The attack uses the prompt to contaminate the RAG database and propagate via three routes: generating a new email from a subject, enriching the content of an existing email, or generating a reply to an email. End to end evaluation uses an ecosystem of GenAI powered email assistants and the Enron data set, with 20 employees and 2 000 emails in total. The GenAI engine under test is Gemini Flash 1 5 and the experiments vary the context size, embedding algorithms, the number of propagation hops and the engine type. Retrieval augmented generation relies on an embedding algorithm, a similarity measure and a vector store to retrieve knowledge and feed it to the GenAI model. A LangChain based client and VectorStore with cosine similarity were used. A guardrail named Virtual Donkey is introduced to detect and prevent propagation, with a lightweight design that does not require extra inferences and uses similarity based features fed into a simple classifier.

Key Findings

The Morris II worm can trigger a chain reaction in which infected applications perform malicious actions and contaminate additional RAG databases, enabling ecosystem level RAG poisoning and data exfiltration of confidential user information.
Propagation is influenced by context size and the embedding algorithm used. Experiments with five embedding algorithms show that the retrieval rate depends on the embedding type and size, with Google text embeddings achieving higher retrieval rates than some alternatives under certain settings.
Across the three propagation routes, the combined success rate remains substantial, and for a context size of around twenty emails the combined propagation rate via generating a new email and via enriching content reaches about twenty percent, with higher rates observed for certain configurations.
Replication and payload success stay above ninety percent for early hops, but decline after many hops due to non deterministic model behaviour; overall propagation effectiveness degrades as the number of hops increases beyond about eleven to twelve.
Payload evaluation shows that the worm can retrieve sensitive data from the context and include it in generated content, with evaluation of data exfiltration akin to a named entity recognition task. The F1 score for payload quality decreases as context grows, and there are examples of hallucinated data such as fake email addresses when tested against large contexts.
Virtual Donkey, a guardrail based on analysing similarities between input and output of a GenAI model, can detect adversarial self replicating prompts with perfect true positive rate of 1 0 and false positive rate of 0 015. It generalises to unseen jail breaking prompts, different email datasets and various worm use cases, and provides a Python implementation on a public repository.
Out of distribution tests using Hillary Clinton emails and other payload types show that the guardrail maintains high discriminative ability with AUC values between 0 79 and 1 0, and ROC based measures remain robust across a wide range of in distribution and out of distribution inputs.

Limitations

The Virtual Donkey guardrail relies on similarity based input output analysis and a lightweight classifier, which, while effective in the reported experiments, may be challenged by adaptive attacks and new jail breaking strategies. The study notes that the guardrail is not tested against adaptive attackers and acknowledges that nothing currently prevents attackers from deploying adaptive adversarial self replicating prompts in the wild. Experiments are conducted in a lab environment using the Enron and Hillary Clinton email datasets, and results may differ in live ecosystems with different data distributions and network topologies. The authors emphasise that the guardrail is a mitigation that reduces risk but is not a complete guarantee against Morris II.

Why It Matters

The findings highlight the need for strong guardrails, prompt auditing, sandboxing and cross app trust controls to harden GenAI enabled ecosystems against chained prompt attacks. The Virtual Donkey guardrail demonstrates a practical approach to detect and halt propagation with high accuracy and low latency, and the study provides evidence that such guardrails can be integrated into GenAI powered email clients and similar applications. The practical takeaway is that defensive measures should be designed to be lightweight, scalable and robust to out of distribution attacks in order to reduce ecosystem level risks posed by Morris II and similar chained prompt threats.

Attribution Original paper on arXiv