Diagnose and Harden AI Agents with AgentDoG
Agents
AgentDoG is a practical step toward accountability for autonomous agents. It does two straightforward things: first, it organises agent risk with a three-dimensional taxonomy that records where risk comes from, how it manifests and what harm results; second, it uses that structure to monitor entire agent trajectories and produce provenance-aware diagnoses rather than a yes/no label. That provenance piece is vital in the field where answers without origins are useless for mitigation.
Large Language Model (LLM) means a text generator trained on lots of examples; I will use LLM for the underlying models. A trajectory is the sequence of messages, tool calls and observations that an agent produces during an interaction; AgentDoG inspects those sequences. Provenance means mapping an unsafe outcome back to the planning step, tool selection or context interpretation that caused it, so you can fix the root cause.
What the work delivers
The team offers a fine-grained benchmark called ATBench with 500 full trajectories, balanced between 250 safe and 250 unsafe examples, averaging nearly 11 messages per trajectory. The evaluation uses a library of 2292 tool definitions and reports that roughly 52 per cent of synthesized trajectories survive quality control. AgentDoG models are released across Qwen and Llama families in 4B, 7B and 8B sizes, and the authors report state-of-the-art moderation numbers on standard tests: for example, 92.7 per cent F1 on R judge with Qwen3 4B compared with 91.8 for a listed GPT baseline and approaching a higher reference model at 95.3 per cent. On ASSE Safety the Llama 3 8B variant attains 83.4 per cent F1.
The diagnostic performance is where it gets interesting. AgentDoG improves attribution to root causes: on ATBench a Qwen 3 FG 4B variant reports 82.0 per cent accuracy for identifying the risk source, 32.4 per cent for failure mode and 58.4 per cent for real-world harm. That 32 per cent number for failure mode is a reminder that mapping behaviour to a precise failure taxonomy remains hard, but the model moves the needle on transparency.
Practical control options
If you are responsible for an agent pipeline, treat AgentDoG as both a test harness and a diagnostic tool. Use the released models and ATBench to red-team trajectories, reproduce failure classes and validate that fixes actually change provenance traces. Don’t rely on a binary safety flag; demand an attribution trace you can act on.
- Minimal: run the 4B variant as a monitor to flag risky trajectories and collect provenance logs for triage.
- Better: integrate the 7B model into CI, require root-cause notes for blocked tool calls, and use ATBench slices to regression-test fixes.
- Best: run the 8B variant in a shadow mode against production agents, automate rollback or step-limiting on verified high-risk provenance, and keep ongoing adversarial tests focused on guardrail targeting.
Limitations matter. AgentDoG currently handles text trajectories; GUI and multimodal agents need extension. Guardrails can be attacked, so use layered controls and periodic adversarial evaluation. Still, the contribution is tangible: a structured taxonomy, a repeatable benchmark and open models you can use to move from vague worry to targeted hardening in agent pipelines.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security
🔍 ShortSpan Analysis of the Paper
Problem
The rapid emergence of AI agents that can plan, use tools and interact with their environment introduces complex safety and security challenges. Existing guardrails often lack agentic risk awareness and transparent risk diagnosis, making it difficult to understand why an agent acts unsafely or why seemingly safe actions may be unreasonable. The authors propose a unified three dimensional taxonomy to categorise agentic risks by where they originate (risk source), how they manifest (failure mode), and what harm they cause (real world consequence). Guided by this taxonomy, they present a fine grained safety benchmark called ATBench and a Diagnostic Guardrail framework named AgentDoG. AgentDoG aims to monitor trajectories with context, diagnose root causes with provenance, and offer explanations beyond binary labels, across three model sizes (4B, 7B and 8B) in Qwen and Llama families. The work reports state of the art performance and openly releases all models and datasets.
Approach
The central contribution is a three dimensional agent safety taxonomy that orthogonally records risk source, failure mode and real world harm. Risk sources cover user inputs, environmental observations, external tools or APIs and internal model faults. Failure modes distinguish behavioural failures such as flawed planning or unsafe tool use from output content failures including misleading or unsafe information. Real world harms capture physical, financial, privacy, reputational and societal consequences. Building on this taxonomy, the authors create ATBench a fine grained benchmark to analyse trajectory level safety and risk diagnosis. AgentDoG provides trajectory level monitoring and an Explainable AI module that traces unsafe actions to planning steps, tool selections or context interpretations, enabling provenance based diagnosis. Data synthesis for training and evaluation uses a three stage planner driven pipeline to generate long horizon, tool augmented trajectories with controllable risk injection. A quality control process combines deterministic validators and large language model judgments to verify alignment with the taxonomy. The authors release open source model variants across Qwen and Llama in three sizes and a large synthetic data pipeline with thousands of tools to stress test guardrails.
The evaluation spans multiple datasets including R judge and ASSE Safety, supplemented by ATBench. In data generation, the authors synthesise trajectories by sampling risk configurations along three taxonomy dimensions, constructing coherent multi step plans, instantiating tool use and injecting risk at defined points. The orchestrator coordinates user queries, tool responses and agent outputs, and a post hoc summary describes outcomes. QC verifies structure, tool correctness and taxonomy alignment, with separate checks for unsafe versus safe trajectories and their fine grained labels. The ATBench data set comprises 500 full trajectories with 250 safe and 250 unsafe examples, averaging nearly 11 messages per trajectory and 1577 tool identifiers, with an independent tool library of 2292 definitions for evaluation. The authors also provide a dedicated tool free evaluation setting to test generalisation to unseen tools.
AgentDoG models are trained with supervised fine tuning on demonstrations, using a learning rate of 1e-5. The evaluation framework compares AgentDoG against a range of guard models and general purpose models, reporting trajectory level accuracy, precision, recall and F1 scores, along with fine grained taxonomy accuracy for Risk Source, Failure Mode and Real-world Harm.
Key Findings
- AgentDoG achieves state of the art in trajectory level safety moderation on diverse benchmarks. On R judge it attains 92.7 per cent F1 with Qwen3 4B in comparison with 91.8 per cent for GPT 5 2 and approaching Gemini 3 Flash at 95.3 per cent.
- On ASSE Safety, AgentDoG with Llama 3 8B reaches 83.4 per cent F1, outperforming several specialised guard models and balancing precision and recall more effectively than many baselines.
- Fine grained diagnosis on ATBench shows AgentDoG substantially improves attribution to root causes. Risk Source accuracy reaches 82.0 per cent, Failure Mode accuracy 32.4 per cent and Real world Harm accuracy 58.4 per cent for AgentDoG Qwen 3 FG 4B, outperforming baselines that struggle with mapping trajectories to taxonomy labels.
- The Agentic XAI Attribution framework demonstrates trajectory level and sentence level attribution, identifying how planning steps and specific sentences contribute to an unsafe target, with case studies reproducing prompt injection, sarcasm mis interpretation and ambiguous transaction risks. This improves transparency and supports targeted mitigation.
- ATBench provides a robust held out benchmark with 500 trajectories and balanced taxonomy coverage, enabling evaluation of long horizon tool augmented interactions and generalisation to unseen tools via a 2292 tool library. QC retains around 52 per cent of trajectories after filtering to ensure high quality evaluation data.
Limitations
AgentDoG currently handles text based trajectories and may require extension to multimodal inputs for GUI driven agents. The evaluation framework notes potential out of distribution effects for guard models not trained on long tool augmented trajectories. The authors also acknowledge guard rails themselves can be targeted, necessitating ongoing security testing and mitigation planning.
Why It Matters
The work provides a principled, taxonomy driven approach to diagnosing why AI agents act unsafely and why seemingly safe actions may be flawed. By offering an interpretable provenance based guardrail, an extensive fine grained safety benchmark and open source model variants, it enables practical testing and hardening of agent decision pipelines, supporting safer deployment in high stakes contexts. The framework supports security testing and could inform reward shaping or reinforcement learning based alignment strategies, contributing to greater transparency and accountability in autonomous agents while recognising that guardrails themselves are potential targets for adversaries.