Secure AI Agents Need Dynamic Plans and Policies
Agents
Security folks know this feeling. A new capability arrives, looks helpful, then quietly starts taking instructions from places you did not expect. AI agents powered by Large Language Models (LLMs) now read the web, handle email, and call tools. Indirect prompt injection exploits that curiosity. Malicious text in a web page, message, or third party output slips into an agent’s context and steers it to do something unhelpful or unsafe.
What the paper argues
This position paper takes a system view. The authors say we should not expect model fine tuning or a few text filters to save us. Instead, they sketch an architecture with a clear control loop: an orchestrator plans work, a plan and policy approver checks intent, an executor acts, and a policy enforcer constrains effects. Defence in depth matters. Rule based checks do what they can. Learned models help only inside tight boxes. People stay in the loop when judgement and preference become the issue.
First, dynamic replanning and policy updates are necessary. Real environments change. APIs move. Debugging takes iteration. A static plan or a frozen policy grows stale and brittle. The system should treat replanning and policy evolution as routine, and do so with security context rather than as a blind retry.
Second, some security judgements are contextual and need learned models, but only on narrow, structured inputs with constrained outputs. In other words, do not let a model read arbitrary environmental text and then decide on access or actions. Instead, feed it a structured plan change, or a compact description of a proposed tool call, and ask a bounded question. The paper even points to using models to synthesise deterministic validators, which you then run without exposing the model to fresh untrusted text.
Third, language is ambiguous and objectives are messy. There are cases where the system cannot resolve intent automatically. Personalisation and human interaction should be first class design elements, not an afterthought. That makes the inevitable edge cases explicit and accountable.
Why this feels familiar
If you have lived through earlier waves, this rhyme is comforting. We once let documents carry active content that executed wherever it landed. The fix was not a perfect scanner. It was a scaffold of constraints, explicit consent, and smaller, better defined decision points. Over time we learned to limit what untrusted inputs could see, what they could call, and who had to approve the grey areas.
The paper also takes aim at current benchmarks. Many tests use short, static tasks and non adaptive payloads. They rarely push agents through multi step jobs that force replanning or policy revision. This flatters both utility and security. In practice, attackers adapt and environments shift. Evaluations should too.
For practitioners, the implications are concrete. Treat the agent’s control loop as a first class security surface. Keep plans and policies explicit and versioned so you can approve diffs, not blobs of text. Use learned models as adjudicators over structured artefacts with narrow prompts and outputs. Where checks must run at speed, consider synthesising deterministic validators from models and then freezing them. For high ambiguity decisions, design the human touchpoint early and let users express preferences that shape enforcement.
This is not a counsel of despair. It is a call to move security attention from the model’s raw appetite for text to the system’s bones. By deciding what evidence the model can observe and what choices it can make, you shrink the attack surface and gain clearer explanations when things go wrong. Better benchmarks that include long running tasks and adaptive attacks will help the field separate hopeful demos from robust designs. That rhythm is old, and it is usually how we make progress.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks
🔍 ShortSpan Analysis of the Paper
Problem
The paper examines indirect prompt injection, where malicious instructions embedded in untrusted environmental data (for example retrieved web pages, emails or third-party tool outputs) cause LLM-powered agents to take dangerous actions. This vulnerability is critical as agents gain autonomy and are deployed in higher‑stakes settings. The authors argue that defending agents requires system‑level design choices rather than relying solely on model tuning or simple text filters.
Approach
The authors present a position paper that proposes a system architecture for secure agents built around explicit plans and policies and a control loop comprising an orchestrator, a plan/policy approver, an executor, a policy enforcer and the environment. They advocate a defence‑in‑depth strategy that combines rule‑based checks, constrained use of learned models, and human interaction. The paper sets out three core positions, offers two concrete proposals for using LLMs safely in security decisions, and critiques existing benchmark designs.
Key Findings
- Dynamic replanning and dynamic policy updates are necessary for realistic, long‑running or interactive tasks because static plans or policies break in dynamic environments (for example when APIs change or debugging requires iterative fixes).
- Some context‑dependent security judgements require learned models, but these models must operate only on narrowly scoped, structured inputs and constrained tasks so attackers cannot steer them via raw environmental text.
- Ambiguity in language and objective alignment means certain decisions cannot be resolved algorithmically; human interaction and personalisation should be treated as core design elements for those cases.
- Co‑design of system and model defences is valuable: by constraining what models observe, model‑robustness research can target well‑defined subproblems such as judging plan diffs or synthesising validators rather than arbitrary text filtering.
- Common benchmark evaluations overestimate security and utility because they use short, static tasks, non‑adaptive attack payloads and few multi‑step scenarios that require replanning or policy updates.
Limitations
The paper is a position piece rather than an empirical evaluation; it focuses on general‑purpose agents with full autonomy and on indirect prompt injection only. It assumes system and user prompts are trusted and the environment may be partially compromised. The authors do not cover traditional sandboxing techniques and do not provide a comprehensive survey of prior work.
Why It Matters
System‑level defence design structures agent behaviour and reduces attack surface by deciding what evidence and choices are visible to model or human judges. Practical implications include adopting dynamic, security‑aware replanning and policy evolution; using LLMs only as bounded adjudicators over structured artefacts; synthesising deterministic validators from models rather than exposing models to raw text; and incorporating human oversight and personalisation where intent is ambiguous. These measures improve explainability, enable defence in depth and make attacks more costly and less likely to succeed, guiding research priorities for model robustness and human‑in‑the‑loop interfaces.