SkillAttack maps prompt-only exploits in agent skills
Agents
Agent ecosystems now pull capabilities from open skill registries. That speed comes with a familiar trade-off: you inherit other people’s bugs and design assumptions. Static reviews catch obviously malicious skills, but they miss a quieter problem. Non-malicious skills can still be steered into dangerous operations by prompts alone. SkillAttack is a study of that gap, and it is uncomfortably effective.
SkillAttack treats exploit discovery as path search over the behaviour of a Large Language Model (LLM) agent using a given skill. It runs in three stages. First, a judge model audits the skill’s instructions and code to mark attacker-controllable inputs, sensitive operations, and likely weak points. Second, it maps one “attack path” per candidate vulnerability and crafts prompts intended to drive the agent toward that path. Third, it executes those prompts against an agent, records execution traces and artefacts, asks the judge if the goal was met, diagnoses where the agent diverged, updates the path, refines the prompt, and repeats for several rounds. Most wins show up late: about 65% of successes first appear in rounds three or four, which explains why single-shot probing feels safe but isn’t.
Across 10 LLMs, including GPT-5.4, Gemini 3.0 Pro Preview, and Claude Sonnet 4.5, the framework outperforms simpler baselines. On 71 adversarial skills, it reaches Attack Success Rates (ASR) between 0.73 and 0.93. On the top 100 real-world skills from a public registry, it still lands up to 0.26 ASR. A naive one-shot direct attack is near zero; even an injected-prompt baseline trails well behind. Threat profiles differ: obvious adversarial skills span many hazards; contextual skills skew toward manipulation; the real-world set concentrates on operational harm. In the latter, data exfiltration and malware or ransomware account for over 70% of successful cases.
Why it works is instructive. The loop learns from the agent’s own traces, adapts wording to slip past guardrails, and nudges tool use until the sensitive operation fires. No code edits, no skill tampering, just patient prompt engineering that exploits what the skill already exposes.
There is a policy story here. Open registries multiply reach, but also amplify latent risk that only shows up under interaction. Static certification will not buy much assurance if exploitability is dynamic and model-specific. Who vouches for third-party skills, and on what evidence? How often should that evidence be refreshed as models change? SkillAttack does not answer these questions, but it makes them hard to ignore.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies whether externally sourced agent skills can be abused by adversarial prompts without modifying the skill code or the agent runtime. Open skill registries let agents acquire functionality from third parties, but that openness makes thorough vetting difficult. While prior work concentrated on plainly malicious skill files that are detectable by static audits, the authors address a subtler risk: non-malicious skills that contain latent vulnerabilities an attacker can trigger purely through crafted interactions. Demonstrating such prompt-only exploits matters because they enlarge the practical attack surface of LLM-based agent systems beyond what static analysis can certify.
Approach
The authors introduce SkillAttack, an automated red‑teaming framework that treats exploit discovery as a path search and uses a closed‑loop of analysis, parallel attack generation and feedback-driven refinement. Stage one, skill vulnerability analysis, uses an agent-as-judge auditing model to extract attacker-controllable inputs, sensitive operations and candidate vulnerabilities from a skill's instruction and implementation. Stage two, surface-parallel attack generation, infers one attack path per candidate vulnerability and constructs plausible prompts intended to steer the agent toward the vulnerable operation. Stage three, feedback-driven exploit refinement, executes prompts against an agent, collects execution traces and artefacts, uses a judge to determine success, computes path deviations and failure causes, updates the attack path and refines the prompt, and repeats for up to a fixed number of rounds. The method was implemented within the OpenClaw agent framework and evaluated on two skill sets: 71 adversarial skills (30 obvious injections, 41 contextual injections) and the top 100 real-world skills from a public registry. Experiments spanned 10 LLMs including GPT-5.4, Gemini 3.0 Pro Preview and Claude Sonnet 4.5. Attack success was measured as Attack Success Rate, judged from execution trajectories, artifacts and final responses.
Key Findings
- SkillAttack substantially outperforms baselines: ASR of 0.73 to 0.93 on adversarial skills and up to 0.26 on real-world skills, while a naive one-shot Direct Attack yields near-zero ASR and an injected-prompt baseline peaks much lower.
- Most successful exploits require iterative refinement: about 65% of successes first appear in rounds three or four, showing single-round probing misses the majority of compromises.
- Threat profiles vary by skill type: adversarial-obvious skills expose diverse hazards, contextual skills are dominated by manipulation attacks, and the real-world Hot100 set concentrates on operational threats, with data exfiltration and malware or ransomware exceeding 70% of successful cases.
Limitations
Evaluations use a single judge model for success classification; multiple judges or human annotation would increase reliability. The framework only considers prompt-level attacks and does not examine multi-agent collusion or environment-level interventions. The analysed 171 skills cover benchmarks and a top-100 snapshot but are a fraction of live ecosystems. The work identifies vulnerabilities but does not develop or test mitigations.
Implications
Attackers can exploit latent vulnerabilities in legitimate skills by crafting multi-turn prompts that steer agents into sensitive operations, enabling data exfiltration, malware execution, privilege escalation, manipulation and other operational harms without altering skill artefacts. Because successful exploitation commonly requires iterative, feedback-informed prompts, attackers can adapt to agent guardrails and induce real tool use or file access. This raises offensive security concerns for organisations that accept third‑party skills: vetting must consider dynamic exploitability, and attackers need only prompt-level access to realise impactful breaches.