Structured auditing catches malicious chains in agent skills

Agents

Published: Wed, Apr 29, 2026 • By Adrian Calder

New research frames pre-load auditing of untrusted agent skills as a package-level task, not a single-file scan. A factorised pipeline (SkillGuard-Robust) pulls evidence across files, verifies only where uncertain, and reconciles rewrites. Reported results are strong across held-out and external packages, though transfer to messier sources remains unresolved.

Agent ecosystems are being stitched together from reusable “skills” pulled from public repositories. Those skills rarely live in one file. A typical package includes a SKILL.md, helper scripts, reference docs and bits of repo context. Attackers know this. They split intent across files, hide remote helpers as dependencies, and preserve the same attack under harmless-looking rewrites. Single-shot guardrails that read one file or one prompt tend to miss the chain.

New work proposes treating this as a package problem, not a prompt problem. The authors define pre-load auditing as a three-way decision — benign, suspicious, or malicious — where “malicious” means at least one decisive cross-file attack chain. Their system, SkillGuard-Robust, is a factorised pipeline: it extracts role-aware evidence per file, invokes selective semantic verification only when the cheap pass is uncertain, arbitrates between conflicting chains, and consolidates labels across anchor texts and their rewrites.

The focus is on three recurring chains: hidden override (a file quietly supersedes the declared behaviour), disguised transfer (exfiltration or relaying dressed up as sync or archiving), and remote bootstrap (pulling code or commands from outside under the guise of a dependency). By lifting each file into a structured vector and only spending verifier budget on ambiguous spots, the pipeline aims for stable, pre-load blocking decisions that survive semantics-preserving rewrites.

On a 404-package held-out aggregate, the paper reports 97.30% overall exact match, 98.33% recall on malicious risk, and 98.89% attack consistency. On a 254-package external-ecosystem view, it claims 99.66%, 100.00%, and 100.00%, respectively. The largest single gain comes from the selective verifier; later stages mop up edge cases like which chain “wins” and whether anchors agree with their rewrites.

There are caveats. Many risky samples are sanitised reconstructions, not live malicious repos, so treat the numbers as method comparison, not field prevalence. The system leans on one strong verifier and fixed thresholds; generalising to harder source shifts and deeply nested repos is still open. Policy also defines where “suspicious” ends and “malicious” begins.

Commercially, the pitch is straightforward: if you run an internal catalogue or a marketplace of Large Language Model (LLM) agent skills, a package-level auditor like this could automate most pre-load gating and reduce brittle, single-file filters. Offensively, the paper is a reminder that splitting intent across roles and files works. Defenders need to read the package, not the brochure. Whether these results hold in the wild, at scale, is the next test — watch this space.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

Authors: Lijia Lv, Xuehai Tang, Jie Wen, Jizhong Han, and Songlin Hu

Agent Skills package SKILL.md files, scripts, reference documents, and repository context into reusable capability units, turning pre-load auditing from single-prompt filtering into cross-file security review. Existing guardrails often flag risk but recover malicious intent inconsistently under semantics-preserving rewrites. This paper formulates pre-load auditing for untrusted Agent Skills as a robust three-way classification task and introduces SkillGuard-Robust, which combines role-aware evidence extraction, selective semantic verification, and consistency-preserving adjudication. We evaluate SkillGuard-Robust on SkillGuardBench and two public-ecosystem extensions through five large evaluation views ranging from 254 to 404 packages. On the 404-package held-out aggregate, SkillGuard-Robust reaches 97.30% overall exact match, 98.33% malicious-risk recall, and 98.89% attack exact consistency. On the 254-package external-ecosystem view, it reaches 99.66%, 100.00%, and 100.00%, respectively. These results support a bounded conclusion: factorized package auditing materially improves frozen and public-ecosystem robustness, while harsher external-source transfer remains an open challenge.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies pre-load security auditing of untrusted Agent Skills, i.e. packaged capability units that combine SKILL.md, scripts, reference documents and repository context. This packaging creates a multi-file attack surface in which decisive malicious intent may be distributed across files, concealed by cover stories, or preserved under semantics-preserving rewrites. Existing guardrails that treat a package as a single text span or apply one-shot moderation often flag risk but collapse the operational distinction between suspicious and malicious under rewrites or cross-file chain conflicts, undermining stable automated blocking before loading.

Approach

The paper formalises pre-load auditing as a package-level three-way classification: benign, suspicious, or malicious, where malicious requires at least one decisive cross-file attack chain. It introduces SkillGuard-Robust, a factorised decision pipeline built around four stages: role-aware structured evidence extraction that lifts each file into an evidence vector, selective semantic verification invoked only for uncertainty-localised cases, conflict-aware chain arbitration to decide dominance between semantic chains, and anchor-consistency consolidation to repair residual disagreement between an anchor and its rewrite variants. The method targets three core semantic chains: hidden override, disguised transfer, and remote bootstrap. Evaluation uses SkillGuardBench (core 327 packages) plus two public-ecosystem extensions to form five large views (Main 401, All-HO 404, Int.-stress 344, Boundary 284, Ext.-eco 254) and an expanded pool of 581 package-level samples. The verifier budget and thresholds are fixed: a verifier reads a small selected subset of files, chain-level yes threshold 0.60, package-floor confidence 0.70, and anchor consolidation requires at least two malicious rewrites.

Key Findings

Package-level context is necessary: reading full packages rather than a single SKILL.md materially improves recovery of malicious cases across metrics.
Factorised pipeline markedly improves stability: SkillGuard-Robust attains high accuracy and stability on large held-out views, e.g. on the 404-package held-out aggregate 97.30% overall exact, 98.33% risk malicious recall, and 98.89% attack exact consistency; on the 254-package external-ecosystem view 99.66%, 100.00%, and 100.00% respectively.
The largest single gain comes from selective semantic verification; later stages (chain arbitration and anchor consolidation) repair narrower residuals such as chain-dominance ambiguity and anchor–rewrite disagreement, yielding near-saturated package decisions.
Strong single-shot remote baselines and structured one-shot judges often find risk but under-recover malicious intent, producing a persistent collapse gap between flagged accuracy and risk malicious recall that the factorised approach largely eliminates.

Limitations

Many risk samples are sanitized reconstructions rather than native malicious repositories, so results are best interpreted as method comparison and error decomposition rather than prevalence estimates. The system relies on a single strong verifier as the remote component and deterministic staged calibration rather than multi-seed training, so benchmark co-evolution and external-source transfer remain open challenges. The suspicious versus malicious boundary is policy dependent, and some failure modes persist under harder source shifts and mixed or deeply nested repository layouts.

Implications

Offensive implications are explicit: attackers can craft multi-file chains to evade single-shot judges by distributing decisive cues into reference or repository-context files, disguise exfiltration or relay as synchronization or archiving, and hide remote helpers as benign dependencies. Semantics-preserving rewrites can induce label drift unless auditors perform structured aggregation and consistency checks. Practical attack strategies include avoiding the verifier’s uncertainty trigger by presenting ambiguous but non-decisive cues, exploiting verifier budget limits and thresholds, and ensuring anchors lack the required number of malicious rewrites to defeat anchor consolidation. These vectors indicate that adversaries able to control repository layout and file roles can materially increase the chance of a package being treated as suspicious rather than blocked as malicious.

Links Original paper on arXiv

Structured auditing catches malicious chains in agent skills

📋 Original Paper Title and Abstract

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

SkillSieve triages malicious agent skills with LLM jury

BadSkill poisons agent skills with embedded backdoors

Backdoored skills let agents rebuild hidden payloads

Related Research

Get the Weekly AI Security Digest