Move privacy controls into RAG retrieval, not prompts

Defenses

Published: Wed, Jan 21, 2026 • By Clara Nyx

Move privacy controls into RAG retrieval, not prompts

SD-RAG moves privacy enforcement out of prompts and into the retrieval stage of Retrieval-Augmented Generation (RAG) systems. It binds natural-language constraints to data chunks in a graph model, sanitises content before it reaches the Large Language Model (LLM), and reports up to a 58% privacy improvement versus prompt-only baselines, while noting synthetic-data and model-size limitations.

Retrieval-Augmented Generation (RAG) systems pair a Large Language Model (LLM) with a retrieval mechanism so the model can cite or use external content. That coupling is powerful and also fragile: most defences rely on telling the LLM not to reveal sensitive content, but prompt-level guards are brittle and famously bypassable with prompt injection. SD-RAG takes a blunt, sensible tack — stop handing sensitive data to the model in the first place.

The paper describes a framework that decouples privacy enforcement from generation by applying sanitisation and disclosure controls during retrieval. The authors build a graph-style index where data chunks and human-readable constraint nodes are linked. Constraints are attached at indexing time to the most similar chunks. At query time the system retrieves both relevant chunks and their candidate constraints, re-ranks constraints, and applies the top K rules to redact or paraphrase content before anything reaches the LLM.

Two redaction modes are supported. Extractive redaction replaces sensitive spans with placeholders. Periphrastic redaction paraphrases or generalises the text to comply with constraints while attempting to preserve meaning. There is also an optional hierarchical summarisation step that groups and generalises content, which can improve privacy but sometimes at the cost of usefulness.

On the experimental side the authors report that SD-RAG gives up to a 58% improvement in their privacy metric compared with a monolithic prompt baseline. They test resilience to prompt injection by running adversarial prompts against the models after retrieval-time sanitisation and find leakage is reduced because the sensitive material simply never reaches the LLM. Experiments use a synthetic, redaction-aware dataset and open-source models (Qwen 2.5 7B and Llama 3 8B), and the paper notes that latency overhead is modest in their setups, with periphrastic redaction sometimes faster than extractive redaction.

That last paragraph contains the caveats you should care about. The evaluation assumes attackers have no prior knowledge of the corpus, does not study multi-turn or iterative de-anonymisation attacks, and is limited to synthetic data and mid-sized open models. The approach also relies on accurate constraint binding at index time; if bindings are noisy or the corpus is poisoned, you can still leak data. In short, the defence is only as good as the indexing and policy attachment process.

What to test

Run retrieval-time sanitisation against realistic data rather than synthetic corpora to measure real privacy-completeness trade-offs.
Simulate multi-turn and iterative de-anonymisation attacks to see whether pre-redaction holds under adaptive probing.
Measure policy maintenance costs and how frequently constraint bindings need refresh when documents or rules change.

SD-RAG does not pretend retrieval-time controls are a silver bullet. But it is a useful corrective: make the data access layer do the heavy lifting for privacy, and treat prompt-based guards as a last line rather than the primary defence. For practitioners that should translate into a clear experiment: prototype retrieval sanitisation, test it against your data and threat models, and be honest about the cost of keeping constraint bindings accurate. If you skip those steps, you are just papering over the same emperor's-new-clothes problem the field keeps selling as security.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SD-RAG: A Prompt-Injection-Resilient Framework for Selective Disclosure in Retrieval-Augmented Generation

Authors: Aiman Al Masoud, Marco Arazzi, and Antonino Nocera

Retrieval-Augmented Generation (RAG) has attracted significant attention due to its ability to combine the generative capabilities of Large Language Models (LLMs) with knowledge obtained through efficient retrieval mechanisms over large-scale data collections. Currently, the majority of existing approaches overlook the risks associated with exposing sensitive or access-controlled information directly to the generation model. Only a few approaches propose techniques to instruct the generative model to refrain from disclosing sensitive information; however, recent studies have also demonstrated that LLMs remain vulnerable to prompt injection attacks that can override intended behavioral constraints. For these reasons, we propose a novel approach to Selective Disclosure in Retrieval-Augmented Generation, called SD-RAG, which decouples the enforcement of security and privacy constraints from the generation process itself. Rather than relying on prompt-level safeguards, SD-RAG applies sanitization and disclosure controls during the retrieval phase, prior to augmenting the language model's input. Moreover, we introduce a semantic mechanism to allow the ingestion of human-readable dynamic security and privacy constraints together with an optimized graph-based data model that supports fine-grained, policy-aware retrieval. Our experimental evaluation demonstrates the superiority of SD-RAG over baseline existing approaches, achieving up to a $58\%$ improvement in the privacy score, while also showing a strong resilience to prompt injection attacks targeting the generative model.

🔍 ShortSpan Analysis of the Paper

Problem

Retrieval augmented generation combines the generative abilities of large language models with retrieval over large data collections, but existing approaches risk exposing sensitive or access controlled information to the generation model and remain vulnerable to prompt injection attacks that can override safeguards. There is a need to separate security and privacy enforcement from the generation process, and to handle dynamic, human readable privacy constraints for unstructured data. SD-RAG proposes a framework that moves sanitisation and disclosure controls to the retrieval stage, prior to presenting data to the language model, to reduce leakage and attack risk.

Approach

SD-RAG implements selective disclosure by applying sanitisation and access controls during retrieval rather than at prompt time. It uses a graph based data model that represents the data corpus as data nodes and privacy constraints as constraint nodes, with semantic links between them. Constraints are ingested in natural language and attached to the top most similar chunks during indexing, creating a constraint binding that connects constraints to content. At query time the system retrieves candidate constraints for the relevant chunks, ranks them with multiple re ranking methods, and applies the top K constraints to the retrieved chunks before they are fed to the language model. A redaction module then enforces the constraints using a safe prompt, producing a redacted context for generation. Two redaction strategies are supported: extractive redaction replaces sensitive spans with placeholders, while periphrastic redaction paraphrases the content to satisfy constraints while preserving meaning. The architecture also supports optional hierarchical chunk summarisation, which generalises details and groups related content, potentially affecting privacy and usefulness. The evaluation uses a synthetic, redaction aware dataset and novel privacy and completeness metrics, and compares SD RAG with a monolithic prompt baseline on open source models such as Qwen 2.5 7B and Llama 3 8B.

Key Findings

The approach achieves up to 58% improvement in privacy score under optimal configuration compared with baseline methods, demonstrating stronger control over information disclosure.
SD RAG provides resilience to prompt injection attacks targeted at the generative model by removing sensitive content at the retrieval stage, reducing leakage even when prompts are adversarial.
Extractive redaction generally yields higher privacy scores than periphrastic redaction, but may incur slightly lower completeness; periphrastic redaction offers a trade off with faster generation times in some configurations.
Under attack, redacted contexts lead to similar completeness across strategies, indicating that the attacker can still influence results but the retrieved content is less sensitive due to pre redaction.
Latency is affected by the pre redaction stage; periphrastic redaction tends to be faster than extractive in the reported setups, and overall the additional time is modest relative to the baseline on modest hardware.
Hierarchical summarisation provides modest privacy gains but can reduce completeness when summarised content is used, suggesting a trade off between privacy generalisation and information sufficiency.
The evaluation employs a two stage constraint binding and a suite of constraint re ranking strategies, with the best results obtained using an average based re ranking that blends chunk relevance with query similarity.
Tests were conducted on synthetic data and two open source models, highlighting the need for further validation on larger scale or closed systems, but the results indicate model agnostic benefits since retrieval time controls operate independently of the underlying LLM.

Limitations

The study makes several assumptions and has limitations: it assumes attackers have no prior knowledge of the corpus, and does not explore multi turn or iterative de anonymisation attacks; the evaluation uses synthetic data and relatively small open source models, which may not capture real world scale or hidden prompts; there is an implicit trust in the data corpus and potential risk if retrieved chunks are poisoned; the dataset size and constraint sets may limit generalisability; further work is needed to optimise constraint attachment methods, develop cross encoders for scoring constraints, and investigate more advanced redaction or meta prompting techniques; practical deployment will require addressing resource constraints and potential policy changes over time.

Why It Matters

By moving privacy and security controls to the data access layer, SD RAG offers a complementary and potentially stronger defence against prompt injection and data leakage in retrieval augmented generation systems. The graph based representation of data and constraints enables dynamic, human readable privacy policies to be encoded and enforced at retrieval time, reducing dependence on prompt level mitigations that can be bypassed. This approach supports testing and evaluation of defence strategies at the data access layer, allowing comparisons between retrieval time safeguards and traditional prompt based mitigations. The work emphasises privacy and responsible data sharing, addressing risks of unintended disclosure or surveillance, while providing practical mechanisms for policy aware data retrieval and protected generation in AI systems.

Attribution Original paper on arXiv