ShortSpan.ai logo

AgentSecBench Measures Prompt Injection, Leakage, Tool Abuse

Agents
Published: Tue, May 26, 2026 • By Rowan Vale
AgentSecBench Measures Prompt Injection, Leakage, Tool Abuse
AgentSecBench puts LLM agents under controlled attacks to test whether untrusted text can steer actions or leak secrets. It shows projection-based defences that remove unauthorised observations or capabilities close measured channels and stop violations, while prompt-only delimiters do not. Transparent filtering helps partially. The study separates enforcement from behaviour across Qwen3-0.6B/1.7B runs.

Most Large Language Model (LLM) agents shovel everything into one generative pipe: trusted instructions, retrieved content, and tool outputs. That design conflates data with authority. If an attacker can plant the right string in retrieval or a tool’s output, the model may leak a secret canary or propose a forbidden action even when policy forbids it.

How the channel bites

Think of the model-visible token stream as the attack surface. If the protected symbol is still there at generation time, the model can act on it. AgentSecBench frames this as three security games with crisp, testable outcomes: instruction integrity (prompt injection), retrieval confidentiality (unauthorised disclosure), and capability integrity (forbidden action proposals). The benchmark uses exact markers so violations are unambiguous. It pairs adversarial runs with benign controls to measure adversarial advantage, and it records whether a defence closes the channel before generation or merely nudges behaviour.

The team ran deterministic decoding on Qwen3-0.6B and Qwen3-1.7B with 576 traces. They evaluated six defence classes: none, delimiter annotation in prompts, transparent filtering, provenance projection for retrieval, capability projection for tools, and a combined stack.

What actually breaks

Projection is the interesting bit. Provenance projection strips unauthorised retrieved symbols before the model sees them. Capability projection removes tool names the agent is not allowed to call. With both in place, every measured channel was closed in these games and the traces recorded zero designated violations. That is enforcement, not vibes.

Transparent filtering helped when it spotted the right coordinates. It closed about 54.2% of measured channels and cut paired adversarial advantage to 0.042. Useful, but partial.

Delimiter annotation told the model where the boundary should be but left the adversarial token stream intact. It closed no measured channels. In one run the observed advantage was 0.438, higher than the unmodified baseline at 0.375. Behavioural resistance without isolation is brittle.

On retrieval confidentiality, provenance projection and the combined stack eliminated canary disclosure in these tests. Prompt annotation left much higher disclosure, with a leakage rate around 0.562 in one condition. In the capability game, least-privilege capability projection removed forbidden capability names before generation and saw zero designated forbidden-action emissions, while the baseline showed very high advantage in some suites.

Limits matter: the benchmark uses exact-match markers, single-turn tasks, and fixed budgets, so it under-approximates paraphrase and multi-turn adaptation. Still, it cleanly separates real channel closure from apparent robustness. The open question is how far this holds once attackers start paraphrasing, obfuscating, and iterating across turns.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

AgentSecBench: Measuring Prompt Injection, Privacy Leakage, and Tool-Use Integrity in LLM Agents

Authors: Faruk Alpay and Taylan Alpay
LLM agents process trusted instructions, retrieved records, and tool observations through a common generative channel. This conflates data flow with authority: an untrusted string can affect a secret-bearing response or an action proposal even when no application policy authorizes that influence. We introduce AgentSecBench as an empirical instantiation of a formal security framework for this problem. The framework defines three games-instruction-integrity, retrieval-confidentiality, and capability-integrity-under a common notion of intent-to-execution noninterference with permitted leakage. It represents an application policy as a projection onto authorized observations and capabilities, distinguishes prompt annotations from enforcing projections, and measures both adversarial advantage and whether a defense closes the relevant model-visible channel before generation. The exact-marker experiments are intentionally one observable instantiation of the games rather than a complete semantic security claim: they test disclosure and forbidden-action distinguishers with unambiguous ground truth. We evaluate six defense classes with Qwen3-0.6B and Qwen3-1.7B on paired adversarial and benign-control executions. The measurements show when risk reduction follows channel closure and when a model-visible adversarial capability remains exploitable. The result is a security-oriented evaluation method: prompt text can describe a boundary, whereas provenance projection, capability restriction, and output validation can enforce one.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies how large language model agents conflate data flow with authority by ingesting trusted instructions, retrieved records and tool observations through a single generative channel. That design lets untrusted strings influence secret-bearing responses or action proposals even when application policy does not authorise that influence. The work frames these failures as formal security games and focuses on measurable breaches: instruction integrity (prompt injection), retrieval confidentiality (unauthorised secret disclosure) and capability integrity (proposal of forbidden actions).

Approach

AgentSecBench is an empirical benchmark that instantiates a formal intent-to-execution noninterference definition with permitted leakage. It defines three games (instruction-integrity, retrieval-confidentiality, capability-integrity), exact-marker designated violation predicates, and a paired adversarial/benign-control protocol. Defences are grouped into six classes: no defence, delimiter annotation, transparent filtering, provenance projection for retrieval, capability projection for tools, and a combined stack. The evaluation runs deterministic decoding on Qwen3-0.6B and Qwen3-1.7B with 576 traces (288 adversarial, 288 paired controls), records metrics including attack success rate, paired adversarial advantage, pre-generation channel-closure, retrieval leakage and benign utility, and distinguishes prompt annotation from enforcement that removes model-visible symbols before generation.

Key Findings

  • Combined projection that enforces provenance and capability restrictions closed every designated measured channel in these controlled games and recorded zero designated violations on those traces.
  • Transparent filtering closed approximately 54.2% of measured channels and reduced paired adversarial advantage to 0.042, showing partial pre-generation mitigation is effective when it detects the protected coordinates.
  • Delimiter annotation (prompt-level boundary markers) closed no measured channels yet sometimes reduced observed attack success; in the reported run it yielded advantage 0.438, higher than the unmodified baseline advantage of 0.375, demonstrating behavioural resistance can coexist with an open adversarial channel.
  • In the retrieval-confidentiality game, provenance projection and the combined stack eliminated canary disclosure in these tests, whereas prompt annotation left a substantially larger disclosure rate (for example leakage around 0.562 in one condition).
  • In the capability game, least-privilege capability projection removed forbidden capability names before generation and observed zero designated forbidden-action emissions; the baseline recorded very high advantage for forbidden actions in some suites.
  • Reporting channel-closure separately from attack-success clarifies when risk reduction is due to true enforcement versus model-dependent behavioural change; defences that leave the adversarial token visible cannot claim enforced noninterference.

Limitations

The benchmark uses exact-match markers as low-variance distinguishers so it under-approximates semantic violations such as paraphrase, obfuscated encodings or invented references. Tasks are controlled, single-turn and fixed-budget; adaptive multi-turn adversaries, mixed-document provenance, richer feature mappings and stochastic decoding are outside this evaluation. Benign utility is a lexical proxy and not a semantic quality score. The threat model assumes no model weight compromise or training-data poisoning and measures disclosure via access-control failures rather than parameter memorisation.

Implications

Offensive implications are clear: an adversary who can influence retrieved content or tool outputs may cause an agent to disclose secrets or propose forbidden actions unless the system enforces projection of unauthorized observations or restricts capabilities. Prompt-level annotations alone are insufficient to guarantee security because they leave the model-visible channel intact; attackers can exploit that channel to induce harmful outputs. Defences that remove protected symbols before generation and that validate proposed actions post-generation are necessary to make agent deployments robust against injection, confidentiality breaches and tool-use hijacking.


Related Articles

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.