AgentCanary tests autonomous agents in real environments
Agents
Autonomous agents have shifted failure modes from dodgy text to system-level trouble: files modified, accounts touched, data moved. AgentCanary studies that world by dropping agents into containerised environments with real tools and state that persists across steps. No mocked tool outputs, no static Q and A. It watches what the agent actually does.
A benchmark that behaves like a real system
The framework builds realistic task artefacts on the fly: inboxes, webpages, virtual accounts, third party skills and persistent memory. Agents run inside isolated containers, leaving observable traces and side effects. Crucially, evaluation reads the full trajectory, not a single reply. Three decomposed metrics keep the story straight: Outcome Safety Score (did harm occur), Security Awareness Score (did the agent recognise the attack), and Task Utility Score (did it still get useful work done). Judges are rubric driven Large Language Models (LLMs) with task specific guidance.
A tidy threat model, then pressure
I like the orthogonal Entry x Impact taxonomy. It cleanly splits how adversarial influence gets in (user interaction, untrusted external content, compromised skills, persistent memory or intrinsic failure) from what damage it tries to cause. Tasks map one entry to one impact, spanning harms such as data exfiltration, credential leakage, unauthorised transactions and backdoors. That decoupling makes failure analysis easier: you see whether the agent failed at provenance, reasoning or execution.
Under pressure, many agents wobble. One model, Claude Opus 4.6, sets a clear frontier with an overall score of 83.9, high outcome safety (88.1), solid security awareness (74.8) and a 12.9 percent unsafe outcome rate. Most others trail by a margin, and scaling alone does not guarantee robustness. Awareness and outcomes routinely decouple: agents sometimes avoid harm without noticing the attack, a brittle silent obedience that breaks under adaptive variants.
The nastiest cases come from state. Memory contamination and long horizon staging push unsafe outcomes up, especially when signals are dispersed across steps or sessions. Skill poisoning bites hard too: camouflaged malicious logic inside executable scripts collapses risk perception and drives compromise. Static prompt injection templates are not enough; iterative attack evolution and progressive campaigns reveal failures that simple tests miss. Runtime defences help a little, variably, and can trade utility for safety.
This is a designed benchmark with LLM judges and specific frameworks, not a census of the wild. But as a systematic baseline for agents that really execute, it is the right kind of uncomfortable. The open question is how quickly deployment stacks can learn to scrutinise provenance, vet skills and reason over long, messy state without dulling their utility.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies security risks introduced by autonomous AI agents that execute actions in real computing environments rather than only generating text. This shift exposes system-level surfaces such as files, persistent memory, third-party skills, communication channels and transactions to adversarial influence. Prior evaluations suffer from fragmented coverage of attack types, low-fidelity or static tool mocks, and single-dimensional outcome metrics that obscure whether agents actually recognised an attack or merely avoided consequences by chance. A more systematic, environment-grounded benchmark is needed to assess risks that unfold across multi-step, stateful workflows.
Approach
AgentCanary is a security evaluation framework that combines three elements. First, an orthogonal Entry × Impact taxonomy separates where adversarial influence enters an agent (five entries: user interaction, untrusted external content, compromised skills, persistent memory/state, intrinsic failures) from the realised harms (seven impact categories). Tasks are grounded in realistic workflows and mapped to one entry and one impact. Second, a high-fidelity executable environment runs agents inside isolated containers where they interact with real tools and dynamically provisioned artifacts (inboxes, webpages, virtual accounts, skills and memory) with within-task state persistence to support long-horizon attacks. The environment is agent-framework-agnostic and records system-level side effects. Third, trajectory-grounded multi-dimensional evaluation scores the full execution trace rather than a single reply, using three decomposed metrics: Outcome Safety Score (OSS), Security Awareness Score (SAS) and Task Utility Score (TUS). The suite comprises 496 seed tasks, twelve language models instantiated across three agent frameworks, multiple adversarial regimes including prompt-injection templates, dynamic attack evolution and long-horizon progressive attacks, and evaluation of runtime defence components. Trajectories are judged by LLM-based rubric evaluators under task-specific guidelines.
Key Findings
- Performance is uneven: one model (Claude Opus 4.6) establishes a clear frontier (overall score 83.9) with high OSS (88.1), SAS (74.8) and low unsafe outcome rate (12.9%), while most models perform substantially worse.
- Security does not follow simple scaling: larger or newer models often improve but family and alignment differences produce large internal variation; scale alone does not guarantee robust security.
- Awareness and outcome can be decoupled: agents sometimes prevent harmful outcomes without recognising attacks, especially in indirect prompt injection, creating brittle "silent-obedience" behaviour vulnerable to adaptive variants.
- Memory contamination and long-horizon attacks are severe: persistent-state poisoning and staged cross-session attacks yield high unsafe-outcome rates across many models, showing agents often trust stored memory and fail to integrate dispersed signals.
- Skill poisoning is highly damaging and camouflaged malicious logic collapses defences: hiding malicious behaviour in executable scripts greatly reduces risk perception and sharply raises compromise rates.
- Dynamic and adaptive attacks matter: iterative attack evolution and long-horizon progressive attacks expose failures missed by static tests. Runtime defences provide marginal, model-dependent gains and can trade off utility for safety.
Limitations
AgentCanary is a designed benchmark with a specified task suite and rubric-driven LLM judges; it uses containerised emulation and three agent frameworks which may not cover every deployment variant. The Entry × Impact matrix is a design space rather than a claim about empirical prevalence. The chosen judge model and framework configurations influence measured scores and framework choice can shift per-model security substantially.
Implications
Offensive implications are concrete: adversaries can plant malicious content in external artefacts, contaminate persistent memory, poison third-party skills or distribute staged, low-signal steps that culminate in system compromise, data exfiltration, credential leakage, unauthorised transactions or backdoors. Camouflage and adaptive prompt evolution increase attack stealth and effectiveness. The results indicate attackers can exploit trust in tools and state to achieve persistent, long-horizon compromises unless agents are designed to scrutinise provenance, vet executable skills and link dispersed signals across sessions.