New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

Secure AI Agents Need Dynamic Plans and Policies

Published: Wed, Apr 01, 2026 • By Theo Solander

Agents

New research argues that defending Large Language Model (LLM) agents from indirect prompt injection needs system design, not just model tweaks. It backs dynamic replanning and policy updates, tightly scoped use of learned models, and human involvement where intent is ambiguous. The paper also challenges weak benchmarks that mask real risks.

Security folks know this feeling. A new capability arrives, looks helpful, then quietly starts taking instructions from places you did not expect. AI agents powered by Large Language Models (LLMs) now read the web, handle email, and call tools. Indirect prompt injection exploits that curiosity. Malicious text in a web page, message, or third party output slips into an agent’s context and steers it to do something unhelpful or unsafe.

What the paper argues

This position paper takes a system view. The authors say we should not expect model fine tuning or a few text filters to save us. Instead, they sketch an architecture with a clear control loop: an orchestrator plans work, a plan and policy approver checks intent, an executor acts, and a policy enforcer constrains effects. Defence in depth matters. Rule based checks do what they can. Learned models help only inside tight boxes. People stay in the loop when judgement and preference become the issue.

First, dynamic replanning and policy updates are necessary. Real environments change. APIs move. Debugging takes iteration. A static plan or a frozen policy grows stale and brittle. The system should treat replanning and policy evolution as routine, and do so with security context rather than as a blind retry.

Second, some security judgements are contextual and need learned models, but only on narrow, structured inputs with constrained outputs. In other words, do not let a model read arbitrary environmental text and then decide on access or actions. Instead, feed it a structured plan change, or a compact description of a proposed tool call, and ask a bounded question. The paper even points to using models to synthesise deterministic validators, which you then run without exposing the model to fresh untrusted text.

Third, language is ambiguous and objectives are messy. There are cases where the system cannot resolve intent automatically. Personalisation and human interaction should be first class design elements, not an afterthought. That makes the inevitable edge cases explicit and accountable.

Why this feels familiar

If you have lived through earlier waves, this rhyme is comforting. We once let documents carry active content that executed wherever it landed. The fix was not a perfect scanner. It was a scaffold of constraints, explicit consent, and smaller, better defined decision points. Over time we learned to limit what untrusted inputs could see, what they could call, and who had to approve the grey areas.

The paper also takes aim at current benchmarks. Many tests use short, static tasks and non adaptive payloads. They rarely push agents through multi step jobs that force replanning or policy revision. This flatters both utility and security. In practice, attackers adapt and environments shift. Evaluations should too.

For practitioners, the implications are concrete. Treat the agent’s control loop as a first class security surface. Keep plans and policies explicit and versioned so you can approve diffs, not blobs of text. Use learned models as adjudicators over structured artefacts with narrow prompts and outputs. Where checks must run at speed, consider synthesising deterministic validators from models and then freezing them. For high ambiguity decisions, design the human touchpoint early and let users express preferences that shape enforcement.

This is not a counsel of despair. It is a call to move security attention from the model’s raw appetite for text to the system’s bones. By deciding what evidence the model can observe and what choices it can make, you shrink the attack surface and gain clearer explanations when things go wrong. Better benchmarks that include long running tasks and adaptive attacks will help the field separate hopeful demos from robust designs. That rhythm is old, and it is usually how we make progress.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

Authors: Chong Xiang, Drew Zagieboylo, Shaona Ghosh, Sanjay Kariyappa, Kai Greshake, Hanshen Xiao, Chaowei Xiao, and G. Edward Suh

AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines indirect prompt injection, where malicious instructions embedded in untrusted environmental data (for example retrieved web pages, emails or third-party tool outputs) cause LLM-powered agents to take dangerous actions. This vulnerability is critical as agents gain autonomy and are deployed in higher‑stakes settings. The authors argue that defending agents requires system‑level design choices rather than relying solely on model tuning or simple text filters.

Approach

The authors present a position paper that proposes a system architecture for secure agents built around explicit plans and policies and a control loop comprising an orchestrator, a plan/policy approver, an executor, a policy enforcer and the environment. They advocate a defence‑in‑depth strategy that combines rule‑based checks, constrained use of learned models, and human interaction. The paper sets out three core positions, offers two concrete proposals for using LLMs safely in security decisions, and critiques existing benchmark designs.

Key Findings

Dynamic replanning and dynamic policy updates are necessary for realistic, long‑running or interactive tasks because static plans or policies break in dynamic environments (for example when APIs change or debugging requires iterative fixes).
Some context‑dependent security judgements require learned models, but these models must operate only on narrowly scoped, structured inputs and constrained tasks so attackers cannot steer them via raw environmental text.
Ambiguity in language and objective alignment means certain decisions cannot be resolved algorithmically; human interaction and personalisation should be treated as core design elements for those cases.
Co‑design of system and model defences is valuable: by constraining what models observe, model‑robustness research can target well‑defined subproblems such as judging plan diffs or synthesising validators rather than arbitrary text filtering.
Common benchmark evaluations overestimate security and utility because they use short, static tasks, non‑adaptive attack payloads and few multi‑step scenarios that require replanning or policy updates.

Limitations

The paper is a position piece rather than an empirical evaluation; it focuses on general‑purpose agents with full autonomy and on indirect prompt injection only. It assumes system and user prompts are trusted and the environment may be partially compromised. The authors do not cover traditional sandboxing techniques and do not provide a comprehensive survey of prior work.

Why It Matters

System‑level defence design structures agent behaviour and reduces attack surface by deciding what evidence and choices are visible to model or human judges. Practical implications include adopting dynamic, security‑aware replanning and policy evolution; using LLMs only as bounded adjudicators over structured artefacts; synthesising deterministic validators from models rather than exposing models to raw text; and incorporating human oversight and personalisation where intent is ambiguous. These measures improve explainability, enable defence in depth and make attacks more costly and less likely to succeed, guiding research priorities for model robustness and human‑in‑the‑loop interfaces.

Links Original paper on arXiv

Secure AI Agents Need Dynamic Plans and Policies

What the paper argues

Why this feels familiar

📋 Original Paper Title and Abstract

Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Why It Matters

Related Articles

Study maps agentic AI attack surface and risks

LLM agents break trust boundaries; favour deterministic controls

Study Hardens LLMs Against Jailbreak Exploits

Related Research

Get the weekly digest