Study Exposes Prompt Injection Risks for LLM Agents

Agents

Published: Thu, Feb 12, 2026 • By Elise Veyron

A systematised review maps how prompt injection (PI) attacks can hijack autonomous Large Language Model (LLM) agents and surveys existing defences. The paper introduces AgentPI, a benchmark that tests agents in context-dependent settings and shows many defences that look effective on static tests fail when agents must use real‑time observations. Trade offs between trust, utility and latency are central.

Large Language Models (LLM) are shifting from single‑query assistants to autonomous agents that call tools, observe environments and make decisions over time. That shift matters for security because agents mix untrusted runtime inputs with trusted prompts and memory, creating fresh opportunities for prompt injection (PI) attacks that steer behaviour or leak data.

What the paper covers

The paper is a Systematization of Knowledge (SoK) that surveys 78 papers on prompt injection attacks and defences. It organises the literature along two clear axes. On the attack side, the authors classify payloads by generation strategy: heuristic methods (templates, memory exploits, structural encodings) versus optimisation approaches (gradient or evolutionary search, sampling strategies) that craft stealthy payloads. On the defence side, interventions fall into three stages: text level (filters, detection, prompt hardening), model level (alignment, internal detectors) and execution level (policy enforcement, information flow control, isolation).

Crucially, the authors argue that common evaluations over‑index on static benchmarks. Many defences perform well in those settings by suppressing or filtering contextual inputs, but that strategy breaks down when agents must reason about runtime observations. To test this, they introduce AgentPI, a new benchmark that exercises five context‑dependent tasks across 66 tools in four domains and measures attack success rate, agent utility and computational cost in time and tokens.

Key findings and trade offs

AgentPI exposes a practical gap: no single defence simultaneously delivers high trustworthiness, high utility and low latency. Execution‑level defences offer stronger guarantees but can triple computational cost or cause early refusals that kill utility. Text‑level approaches are cheap but provide weaker integrity protections and often fail to distinguish malicious payloads from legitimate contextual content. The paper also documents a shift in attack techniques from explicit, visible injections to context‑driven and stealthy payloads better suited to black‑box agent settings.

The authors distil nine takeaways and highlight open problems rather than claiming a complete solution. Prominent research directions include fine‑grained attention or information flow control inside agent memory, policy‑based enforcement that reasons about intent and context, and hybrid human–AI arbitration when high‑stakes decisions require external oversight.

For security teams and architect planners the practical lesson is simple and uncomfortable: current defences can give a false sense of safety if you only test on static prompts. If your agents must use runtime data to decide actions, you need layered protections, realistic benchmarks such as AgentPI, and measurement of utility alongside attack rate and latency. That will not eliminate hard trade offs, but it focuses effort on the place where these agents will actually be used.

The work is a useful roadmap. It does not pretend that a single silver bullet exists, but it clarifies where research and engineering must go next to make LLM agents safer in real‑world, context‑rich settings.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

The Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

Authors: Peiran Wang, Xinfeng Li, Chong Xiang, Jinghuai Zhang, Ying Li, Lixia Zhang, Xiaofeng Wang, and Yuan Tian

The evolution of Large Language Models (LLMs) has resulted in a paradigm shift towards autonomous agents, necessitating robust security against Prompt Injection (PI) vulnerabilities where untrusted inputs hijack agent behaviors. This SoK presents a comprehensive overview of the PI landscape, covering attacks, defenses, and their evaluation practices. Through a systematic literature review and quantitative analysis, we establish taxonomies that categorize PI attacks by payload generation strategies (heuristic vs. optimization) and defenses by intervention stages (text, model, and execution levels). Our analysis reveals a key limitation shared by many existing defenses and benchmarks: they largely overlook context-dependent tasks, in which agents are authorized to rely on runtime environmental observations to determine actions. To address this gap, we introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context-dependent interaction settings. Using AgentPI, we empirically evaluate representative defenses and show that no single approach can simultaneously achieve high trustworthiness, high utility, and low latency. Moreover, we show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context-dependent reasoning is essential. This SoK distills key takeaways and open research problems, offering structured guidance for future research and practical deployment of secure LLM agents.

🔍 ShortSpan Analysis of the Paper

Problem

The paper surveys Prompt Injection PI threats in large language model LLM agents, focusing on how untrusted inputs can hijack agent behaviour and compromise autonomy. It presents a Systematization of Knowledge SoK that covers attacks, defences and their evaluation practices, and introduces AGENTPI, a benchmark for testing agent behaviour under context dependent interaction settings. A key finding is that many existing defences and benchmarks largely overlook context dependent tasks where agents rely on runtime environmental observations to decide actions, creating a gap between laboratory tests and real world risk.

Approach

Using a systematic literature review and quantitative analysis, the authors build taxonomies that classify PI attacks by payload generation strategies heuristic versus optimisation and classify defences by intervention level text level model level and execution level. They examine 78 papers on attacks and defences, and empirically evaluate representative defences with AGENTPI across five context dependent tasks and 66 tools spanning four domains. The evaluation uses a multi dimensional metric system that measures attack success rate, agent utility and computational cost in time and tokens. The work also details an agent execution loop with six steps including input prompts reasoning tool calls and tool observations, illustrating how untrusted inputs can merge with trusted prompts in a single context memory.

Key Findings

Attack taxonomy reveals a shift from practical white box base LLM settings to black box LLM agent settings, with attacks evolving from explicit prompts to context driven and invisible payloads. Attacks are categorised as Direct Prompt Injection Indirect Prompt Injection and Supply chain Prompt Injection, and payload generation methods split into heuristic and optimisation based approaches. Heuristic methods include manual templates memory based hijacks and structural encodings; optimisation based methods use gradient based or evolutionary and sampling strategies to create stealthy payloads.
Defence taxonomy groups approaches by intervention stage into text level, model level and execution level, with sub strategies such as detection filters prompt enhancements and input output separation at the text level; model level includes model alignment and IR based detectors; execution level covers task alignment information flow control policy and isolation. Across the board current defences chiefly protect integrity and confidentiality with limited or no direct protection for availability.
AGENTPI exposes a critical gap: context dependent tasks expose an attack surface where runtime environmental observations feed decision making. AGENTPI shows that no single defence achieves high trustworthiness, high utility and low latency; some defences perform well on static benchmarks by suppressing contextual inputs but fail when context dependent reasoning is required.
Trade offs are evident: execution level defences can incur substantial overhead up to three times baseline cost or produce aggressive early refusals that kill task utility; text level defences are cheaper but offer weaker security guarantees; many defences rely on coarse grained interventions and struggle with fine grained control.
The study provides nine takeaways and four open problems, proposing directions such as fine grained attention control, information flow and policy based defences, and hybrid human AI arbitration to balance security with utility in real world settings.

Limitations

Limitations include reliance on published material up to 2025 and the focus on selected benchmarks and defences; AGENTPI is a new benchmark whose adoption is not yet universal; generalisation to all LLM agents remains to be validated and broader benchmarking is needed to cover evolving threat vectors.

Why It Matters

The SoK provides structured guidance for threat modelling and evaluation of secure LLM agents, highlighting that many defences are ill suited for context dependent real world tasks. It offers actionable insights and an open research roadmap to improve defence mechanisms, benchmarking, and deployment strategies that balance security, utility and responsiveness in AI driven agent systems.

Attribution Original paper on arXiv