Limit Agent Input to Prevent Prompt Injections
Agents
A Large Language Model (LLM) is a machine learning model trained to generate and understand text. A prompt injection is an attack that hides instructions inside untrusted input so the model follows an unintended task.
The paper presents type directed privilege separation as a system level defence for LLM powered agents. The design splits the agent into two roles. A quarantined agent processes raw, untrusted content and converts it into a curated set of data types. A privileged agent performs actions and decision making but only receives those typed values, not freeform text.
Practically, the quarantined agent emits simple types such as integers, floating point values, booleans and a constrained set of string literals. The privileged agent accepts only those types, revalidates them, and acts within a constrained context. The quarantined agent can also set opaque variables that the user can read but the privileged agent cannot interpret as instructions.
Key results and limits
In the authors' case studies the approach prevents prompt injections in online shopping and calendar scheduling scenarios while keeping utility high. In a bug fixing agent the defence removes injection risk entirely but reduces task utility dramatically, from 49.7 to 14.6 in the paper's measurements, illustrating a safety versus functionality trade off for context heavy tasks.
The method is model agnostic and does not rely on detectors or fine tuning, which the authors argue are brittle against adaptive attackers and incompatible with some state of the art models. That said, the approach depends on being able to design useful typed interfaces; where an agent needs rich, unrestricted language context the defence imposes substantial cost. It also does not solve other threats such as misinformation or social engineering that use legitimate data to deceive.
Minimal controls and deployment notes
Do not treat this as a magic wand. The core change is an interface redesign: move parsing and interpretation into a hardened, quarantined component and only allow typed outputs across the trust boundary. Add a small revalidation step in the privileged agent before any action. Log every typed value crossing the boundary so you can audit malicious patterns.
- Good: Enforce integers, floats and booleans for high risk actions, reject free text.
- Better: Add fixed enumerations for choices and opaque handles for large content accessible only by the user.
- Best: Combine typed interface with strict revalidation and provenance checks plus runtime rate limits for risky operations.
Checklist for a minimal viable control set: implement a quarantined parser, define a small typed schema for all inputs that can trigger actions, add revalidation in the privileged agent, and record crossings for audit. Expect to iterate: you will need to refine types and enumerations until the agent remains useful and safe.
Bottom line: if you build automated agents that act on external text, removing freeform strings from the decision path is one of the few reliable, model independent ways to eliminate prompt injections. It will force design trade offs, but that is preferable to chasing detectors that fail under adaptive attack.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Better Privilege Separation for Agents by Restricting Data Types
🔍 ShortSpan Analysis of the Paper
Problem
Prompt injections threaten LLM powered agents by allowing adversaries to subvert intended behaviour through untrusted input. Detectors and fine tuning have limitations against adaptive attacks and are not universally compatible with state of the art models. The paper presents type directed privilege separation as a system level defence that converts untrusted content into a curated set of data types, thereby removing the possibility for prompt injections and enabling more robust, model agnostic protection for AI agents operating in real world automation.
Approach
The method builds on the idea of privilege separation by splitting an application into two sub agents: a quarantined agent (Q Agent) that processes untrusted data, and a privileged agent (P Agent) that handles action selection. Data can flow from the quarantined to the privileged agent only if it belongs to a carefully chosen set of data types, including integers, floating point values, booleans and a restricted set of string literals (multiple choice). Freeform text is not allowed to flow to the privileged agent, preventing injected commands from influencing actions. The quarantined agent may set opaque variables, which are readable by the user but not by the privileged agent. The privileged agent re validates the payload and uses it to operate within a constrained context. The approach is designed to be model agnostic and compatible with any LLM, whether proprietary or open weights. It is evaluated across three case studies—online shopping, calendar invitation scheduling, and a bug fixing agent—each implemented with a conventional and a type directed privilege separated design. Attack scenarios target untrusted inputs such as user reviews, email threads and issue descriptions, and the evaluation measures both security and utility with multiple trials.
Key Findings
- Type directed privilege separation prevents prompt injection attacks while preserving utility in the online shopping and calendar scheduling case studies; attack success rate is reduced to zero in these scenarios.
- The approach enables information flow that the standard Dual LLM pattern cannot accommodate, enabling protected operations that rely on limited context such as bug report specifics or calendar details while remaining safe from prompt injections.
- In the bug fixing case study, the defense eliminates prompt injection but incurs a substantial utility cost, with utility dropping from 49.7 to 14.6 and an attack surface reduction to zero percent, illustrating a trade off between safety and task performance in context heavy tasks.
Limitations
While offering strong protection, type directed privilege separation can reduce utility in tasks that rely heavily on unrestricted natural language context. The approach depends on the assumption that the majority of restricted input such as reviews or emails is not manipulated, and its effectiveness can be sensitive to data quality. There are also design challenges in identifying suitable inter agent interfaces that preserve sufficient functionality, and the method does not address all forms of manipulation such as misinformation or deception beyond prompt injections. The authors acknowledge these limitations and indicate directions for future work to improve utility while maintaining security.
Why It Matters
The method provides a practical, deployable defence for AI agents in real world automation by offering a model agnostic, security by design solution that does not rely on model providers or retrofits to new models. By limiting untrusted input to a curated set of data types and enabling safe context transfer, the approach helps reduce the risk of prompt based manipulation in critical systems such as automation, monitoring and decision making, thereby enhancing societal and security outcomes in AI powered environments.