AgentSecBench Measures Prompt Injection, Leakage, Tool Abuse
Agents
Most Large Language Model (LLM) agents shovel everything into one generative pipe: trusted instructions, retrieved content, and tool outputs. That design conflates data with authority. If an attacker can plant the right string in retrieval or a tool’s output, the model may leak a secret canary or propose a forbidden action even when policy forbids it.
How the channel bites
Think of the model-visible token stream as the attack surface. If the protected symbol is still there at generation time, the model can act on it. AgentSecBench frames this as three security games with crisp, testable outcomes: instruction integrity (prompt injection), retrieval confidentiality (unauthorised disclosure), and capability integrity (forbidden action proposals). The benchmark uses exact markers so violations are unambiguous. It pairs adversarial runs with benign controls to measure adversarial advantage, and it records whether a defence closes the channel before generation or merely nudges behaviour.
The team ran deterministic decoding on Qwen3-0.6B and Qwen3-1.7B with 576 traces. They evaluated six defence classes: none, delimiter annotation in prompts, transparent filtering, provenance projection for retrieval, capability projection for tools, and a combined stack.
What actually breaks
Projection is the interesting bit. Provenance projection strips unauthorised retrieved symbols before the model sees them. Capability projection removes tool names the agent is not allowed to call. With both in place, every measured channel was closed in these games and the traces recorded zero designated violations. That is enforcement, not vibes.
Transparent filtering helped when it spotted the right coordinates. It closed about 54.2% of measured channels and cut paired adversarial advantage to 0.042. Useful, but partial.
Delimiter annotation told the model where the boundary should be but left the adversarial token stream intact. It closed no measured channels. In one run the observed advantage was 0.438, higher than the unmodified baseline at 0.375. Behavioural resistance without isolation is brittle.
On retrieval confidentiality, provenance projection and the combined stack eliminated canary disclosure in these tests. Prompt annotation left much higher disclosure, with a leakage rate around 0.562 in one condition. In the capability game, least-privilege capability projection removed forbidden capability names before generation and saw zero designated forbidden-action emissions, while the baseline showed very high advantage in some suites.
Limits matter: the benchmark uses exact-match markers, single-turn tasks, and fixed budgets, so it under-approximates paraphrase and multi-turn adaptation. Still, it cleanly separates real channel closure from apparent robustness. The open question is how far this holds once attackers start paraphrasing, obfuscating, and iterating across turns.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
AgentSecBench: Measuring Prompt Injection, Privacy Leakage, and Tool-Use Integrity in LLM Agents
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how large language model agents conflate data flow with authority by ingesting trusted instructions, retrieved records and tool observations through a single generative channel. That design lets untrusted strings influence secret-bearing responses or action proposals even when application policy does not authorise that influence. The work frames these failures as formal security games and focuses on measurable breaches: instruction integrity (prompt injection), retrieval confidentiality (unauthorised secret disclosure) and capability integrity (proposal of forbidden actions).
Approach
AgentSecBench is an empirical benchmark that instantiates a formal intent-to-execution noninterference definition with permitted leakage. It defines three games (instruction-integrity, retrieval-confidentiality, capability-integrity), exact-marker designated violation predicates, and a paired adversarial/benign-control protocol. Defences are grouped into six classes: no defence, delimiter annotation, transparent filtering, provenance projection for retrieval, capability projection for tools, and a combined stack. The evaluation runs deterministic decoding on Qwen3-0.6B and Qwen3-1.7B with 576 traces (288 adversarial, 288 paired controls), records metrics including attack success rate, paired adversarial advantage, pre-generation channel-closure, retrieval leakage and benign utility, and distinguishes prompt annotation from enforcement that removes model-visible symbols before generation.
Key Findings
- Combined projection that enforces provenance and capability restrictions closed every designated measured channel in these controlled games and recorded zero designated violations on those traces.
- Transparent filtering closed approximately 54.2% of measured channels and reduced paired adversarial advantage to 0.042, showing partial pre-generation mitigation is effective when it detects the protected coordinates.
- Delimiter annotation (prompt-level boundary markers) closed no measured channels yet sometimes reduced observed attack success; in the reported run it yielded advantage 0.438, higher than the unmodified baseline advantage of 0.375, demonstrating behavioural resistance can coexist with an open adversarial channel.
- In the retrieval-confidentiality game, provenance projection and the combined stack eliminated canary disclosure in these tests, whereas prompt annotation left a substantially larger disclosure rate (for example leakage around 0.562 in one condition).
- In the capability game, least-privilege capability projection removed forbidden capability names before generation and observed zero designated forbidden-action emissions; the baseline recorded very high advantage for forbidden actions in some suites.
- Reporting channel-closure separately from attack-success clarifies when risk reduction is due to true enforcement versus model-dependent behavioural change; defences that leave the adversarial token visible cannot claim enforced noninterference.
Limitations
The benchmark uses exact-match markers as low-variance distinguishers so it under-approximates semantic violations such as paraphrase, obfuscated encodings or invented references. Tasks are controlled, single-turn and fixed-budget; adaptive multi-turn adversaries, mixed-document provenance, richer feature mappings and stochastic decoding are outside this evaluation. Benign utility is a lexical proxy and not a semantic quality score. The threat model assumes no model weight compromise or training-data poisoning and measures disclosure via access-control failures rather than parameter memorisation.
Implications
Offensive implications are clear: an adversary who can influence retrieved content or tool outputs may cause an agent to disclose secrets or propose forbidden actions unless the system enforces projection of unauthorized observations or restricts capabilities. Prompt-level annotations alone are insufficient to guarantee security because they leave the model-visible channel intact; attackers can exploit that channel to induce harmful outputs. Defences that remove protected symbols before generation and that validate proposed actions post-generation are necessary to make agent deployments robust against injection, confidentiality breaches and tool-use hijacking.