Tests show agent guardrails miss tricky boundary cases
Agents
Agent security is now a contact sport. Once a Large Language Model (LLM) controls tools and data, prompt injection and tool abuse matter more than racy text filters. A comparative test of four guardrails frames the problem in those terms: can they catch attacks on the agent itself as well as block harmful content?
How the tests worked
The authors sample 1,018 items from eight public security datasets and re-annotate each by hand as BLOCKED or ALLOWED. That yields 852 BLOCKED and 166 ALLOWED entries. They normalise four vendors’ API outputs to the same binary decision and score two metrics: recall and true negative rate. The scenarios skew adversarial and agent-specific, not just generic toxicity.
Results: DKnownAI Guard tops recall at 96.5% and also leads true negatives at 90.4%. Lakera Guard places second on recall at 95.3%. AWS Bedrock Guardrails ranks second on true negatives at 89.8%. Azure Content Safety trails on both in this evaluation. All four use a dual-channel logic: if any detection fires, the input is BLOCKED; it is ALLOWED only if neither channel triggers. Because the ALLOWED subset is intentionally ambiguous and drawn from mostly harmful corpora, false positives rise and true negatives fall. In plain terms, borderline prompts are where guardrails wobble.
Where things break
The paper focuses on attacks that treat the agent as the target. Instruction override and indirect injection bend the agent’s role, often via context the agent is told to trust. A pasted task brief can smuggle new directives. A linked page can hide buried instructions. Chain-of-thought poisoning nudges the model’s internal reasoning toward unsafe tool use. Tool abuse takes over privileged actions once the model believes the request is in scope.
An OpenClaw case study underlines the mechanics: prompt injection to lift credentials, data theft through coerced queries, destructive commands once the agent accepts a hijacked goal, plugin or skill compromise to siphon API keys or plant malware, all worsened by weak default configurations. None of this needs overtly toxic text. It lives in the grey zone where policy and intent are hard to parse.
The kicker is operational: even the best guardrail in this test misblocks about one in ten benign boundary prompts. An attacker can lean into that, crafting near-miss inputs to slip by or flooding systems with ambiguous queries to trigger denials and bury signals. That trade-off between catching more attacks and avoiding noisy lockouts is not just a tuning debate. It sets the terms for procurement, audit, and incident response. The study leaves open questions that matter for deployment: how to weight ambiguous context, how to measure category-specific confidence without collapsing to a single label, and how these controls behave when tools and plugins change under real workload pressure.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
A Comparative Evaluation of AI Agent Security Guardrails
🔍 ShortSpan Analysis of the Paper
Problem
This paper evaluates how well contemporary AI guardrails detect two classes of risk relevant to AI agents: attacks that target the agent itself (for example instruction override, indirect injection, role hijacking, chain-of-thought poisoning and tool abuse) and inputs intended to elicit harmful content (hate speech, pornography, violence). The work matters because AI agents increasingly control privileged actions and external resources, expanding the attack surface beyond traditional content safety into multi-dimensional, covert exploitation.
Approach
The authors compare DKnownAI Guard against AWS Bedrock Guardrails, Azure Content Safety and Lakera Guard. They randomly sampled 1,018 entries from eight public security datasets and performed item-by-item human re-annotation to produce a ground truth binary label per entry: BLOCKED (harmful) or ALLOWED (benign). Re-annotation yielded 852 BLOCKED and 166 ALLOWED entries. Each sample was submitted to every guardrail via API; vendor outputs were normalised into BLOCKED/ALLOWED decisions and compared to human labels to compute recall and true negative rate. The evaluation emphasised adversarial, agent-specific scenarios and higher attack intensity and deception than conventional benchmarks. The paper also discusses real-world agent vulnerabilities via an OpenClaw case study showing prompt injection, data theft, destructive commands, plugin compromise and weak default configurations.
Key Findings
- DKnownAI Guard achieved the highest recall at 96.5% and the highest true negative rate at 90.4%, yielding the best overall performance among the four evaluated guardrails.
- Lakera Guard ranked second in recall at 95.3%.
- AWS Bedrock Guardrails ranked second in true negative rate at 89.8%.
- Azure Content Safety performed relatively lower on both recall and true negative rate in this evaluation.
- All four vendors apply a dual-channel detection logic in which an input is classified as BLOCKED if either detection channel triggers; ALLOWED only if neither channel triggers.
- The ALLOWED subset in this study comprised boundary, high-ambiguity cases drawn from largely harmful datasets; such samples drive higher false positive rates and lower true negative rates compared with ordinary benign data.
Limitations
The evaluation is limited to 1,018 sampled entries from eight datasets and to four vendor products. Human re-annotation introduces subjectivity in ground truth labelling. Mapping diverse vendor outputs into a single binary BLOCKED/ALLOWED label reduces signal about graded confidences or category-specific detections. The ALLOWED examples were selected from primarily harmful datasets and are intentionally ambiguous, which affects measured false positive rates and may not represent typical benign traffic.
Implications
Attackers can exploit agent-specific vulnerabilities described in the paper: prompt injection to exfiltrate credentials, instruction override to perform destructive actions, plugin or skill compromise to steal API keys or deploy malware, and tool abuse to misuse privileged capabilities. The evaluation shows guardrails can still be evaded or confused by high-ambiguity inputs; attackers may craft borderline inputs to bypass detection or to induce false positives and disrupt legitimate use. Even the best-performing guardrail misblocks approximately 10% of benign boundary cases, a gap that could be leveraged for denial-of-service or to obscure malicious activity among noisy alerts.