Framework maps LLM agent attacks across four layers
AI-Infra-Guard lands as an open-source, end-to-end red‑teaming framework for AI agents. It treats the attack surface as four layers and picks different detection methods for each: rule-based infra checks, protocol and skill auditing, black-box agent probing, and a broad jailbreak harness. It also tackles supply‑chain risk in agent skills.
Agent stacks are no longer just a model and an API key. They sprawl across model servers, agent frameworks, Model Context Protocol (MCP) tools, and third‑party skill packages. That sprawl is an attack surface. The interesting move in this research is to stop pretending one scanner fits all of it.
Four layers, four ways to break
The authors ship AI-Infra-Guard, an open-source framework that matches the evidence you can actually collect at each layer with a different testing paradigm. For infrastructure, it goes deterministic: 107 fingerprint rules and 1,443 vulnerability rules spanning 75 components, plus version normalisation and a three‑tier confidence model. That design finds the boring but deadly stuff fast, like unauthenticated model servers that invite unauthorised compute use, model exfiltration or poisoning, and the occasional remote code execution chain.
Move up a layer and protocol tooling gets messy. The MCP auditor runs in static and dynamic modes and uses Large Language Model (LLM) driven analysis with Prompt-as-Rule detection criteria and explicit exclusion conditions to stop the false‑positive firehose. File access is sandboxed, and the scanner hardens itself against indirect prompt injection. Why? Because tool metadata can be poisoned, shadowed, or swapped in a rug pull that quietly steers an agent at run time.
Supply chain is where agents inherit stranger danger. The skill auditor preprocesses packages, performs lightweight retrieval, then applies agentic LLM analysis. It comes with SkillTrustBench: 5,520 evaluation cases drawn from 62,652 collected skills. The risks are concrete: hidden instructions that override policies, over‑privileged actions, or remote payload retrieval that turns a “skill” into a staging point.
Runtime behaviour is probed with capability‑aware, cost‑bounded multi‑turn red teaming. The system escalates in playbook steps and uses canary tokens and stop rules to pin down objective evidence. In practice, that means inducing an agent to leak system prompts, credentials, or environment variables, or to execute commands and fetch server‑side request forgery canaries after a couple of coaxing turns.
At the model layer, the framework unifies jailbreak testing: a simulator plus target model plus an LLM judge, with 26+ attack operators exercised across 16 datasets totalling around 7,000 harmful prompts. Single‑ and multi‑turn strategies, including funky encodings, get rolled into attack‑success profiles you can actually compare across models and attack families.
None of this is magic. Results depend on the base LLMs and on carefully written rules and exclusion conditions; LLM auditors over‑report if you let them. Some fingerprints need code, not declarations. Black‑box probes cost money, so coverage is budgeted, and version inference lowers confidence. Still, layer‑paradigm matching offers a practical baseline the community, and yes policymakers and procurement teams, can align on: evidence first, claims second.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Securing the AI Agent: A Unified Framework for Multi-Layer Agent Red Teaming
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies how to assess and red-team the expanding attack surface of AI agents, arguing that the ecosystem of model servers, agent platforms, tool protocols and packaged skills has outpaced existing security tooling. The authors observe the attack surface is stratified into four layers—infrastructure, protocol and tools, agent behaviour, and the model itself—and that each layer requires a different detection paradigm because the form of evidence that establishes a finding differs across layers.
Approach
The authors present AI-Infra-Guard, an open-source framework that maps a detection paradigm to each layer. For infrastructure it uses a deterministic rule engine with a bespoke AI fingerprint corpus, a version-normalisation procedure and a three‑tier confidence model. For the protocol and tool layer it provides an LLM-driven MCP auditor in static (white-box) and dynamic (black-box) modes, using a "Prompt-as-Rule" style of natural‑language detection criteria and tool sandboxing. Skill supply‑chain auditing inspects packaged skills via preprocessing, lightweight retrieval, and LLM agentic analysis and introduces SkillTrustBench, a benchmark with 5,520 evaluation cases drawn from 62,652 collected skills. For application/behavioural testing it runs capability‑aware, cost‑bounded multi‑turn black‑box red teaming with escalation ladders, stop rules and canary tokens. For the model layer it offers a large jailbreak evaluation harness that composes a simulator, a target model, and an LLM judge, applying many single‑ and multi‑turn attack operators across sixteen datasets totalling around 7,000 harmful prompts to produce attack‑success‑rate profiles. The system is delivered as a central server/agent deployment and as an agent skill wrapper; modules emit structured findings with evidence, severity and remediation suggestions.
Key Findings
- The AI agent attack surface decomposes into four distinct layers, each requiring a different evidence class and detection paradigm rather than a single universal scanner.
- The infrastructure module uses 107 fingerprint rules and 1,443 vulnerability rules across 75 components, plus version normalisation and confidence tiering, yielding fast, reproducible signature detections with near‑zero false positives on encoded signatures.
- The MCP auditor and skill scanner rely on LLM-driven, agentic analysis with Prompt‑as‑Rule detection criteria, explicit exclusion conditions to reduce over‑reporting, sandboxed file access, and active defences against indirect prompt injection aimed at the scanner itself.
- The agent red‑teaming pipeline finds behavioural weaknesses via multi‑turn dialogue, using capability awareness, escalation playbooks and canary tokens to turn subjective judgements into objective confirmations where possible.
- The model‑layer harness unifies many algorithmic and model‑driven attack operators and judges success statistically with LLMs, enabling comparable safety metrics across models and attack families.
Limitations
The framework depends on underlying LLM capabilities and on the quality of rule and prompt specifications; variation in base models affects detection performance. LLM auditors tend to over‑report unless exclusion conditions are carefully authored. Some fingerprints require programmatic logic rather than declarative rules. Plugin loading and execution extend functionality but carry code‑execution risk and assume trusted plugin sources. Black‑box tests are metered and costly, so coverage is budget constrained. A fraction of infrastructure findings are inferred when versions cannot be extracted, producing lower‑confidence alerts.
Implications
Offensive implications are broad. Misconfigured or unauthenticated model servers permit unauthorised compute use, model exfiltration or poisoning and remote code execution. MCP tool metadata can be poisoned, shadowed or used for rug pulls to steer agent behaviour. Packaged skills can embed hidden instructions, over‑privilege, or remote payload retrieval, making supply‑chain compromise a powerful vector. Runtime agents can be induced via prompt injection, role‑play or escalation to leak system prompts, credentials, environment variables, or to execute commands and fetch SSRF canaries. Models remain vulnerable to numerous encoding and multi‑turn jailbreak strategies that can be composed and tuned to bypass defences, and LLM‑based scanners themselves are targetable through indirect injections unless explicitly hardened.