ShortSpan.ai logo

SkillAttack maps prompt-only exploits in agent skills

Agents
Published: Mon, Apr 06, 2026 • By Elise Veyron
SkillAttack maps prompt-only exploits in agent skills
SkillAttack shows how legitimate agent skills can be steered into harmful actions using prompts alone. It automates red teaming with a closed-loop that analyses a skill, generates parallel attacks, and refines them from feedback. Tested on 10 Large Language Models, it finds high success on adversarial skills and real impact on popular, real-world ones.

Agent ecosystems now pull capabilities from open skill registries. That speed comes with a familiar trade-off: you inherit other people’s bugs and design assumptions. Static reviews catch obviously malicious skills, but they miss a quieter problem. Non-malicious skills can still be steered into dangerous operations by prompts alone. SkillAttack is a study of that gap, and it is uncomfortably effective.

SkillAttack treats exploit discovery as path search over the behaviour of a Large Language Model (LLM) agent using a given skill. It runs in three stages. First, a judge model audits the skill’s instructions and code to mark attacker-controllable inputs, sensitive operations, and likely weak points. Second, it maps one “attack path” per candidate vulnerability and crafts prompts intended to drive the agent toward that path. Third, it executes those prompts against an agent, records execution traces and artefacts, asks the judge if the goal was met, diagnoses where the agent diverged, updates the path, refines the prompt, and repeats for several rounds. Most wins show up late: about 65% of successes first appear in rounds three or four, which explains why single-shot probing feels safe but isn’t.

Across 10 LLMs, including GPT-5.4, Gemini 3.0 Pro Preview, and Claude Sonnet 4.5, the framework outperforms simpler baselines. On 71 adversarial skills, it reaches Attack Success Rates (ASR) between 0.73 and 0.93. On the top 100 real-world skills from a public registry, it still lands up to 0.26 ASR. A naive one-shot direct attack is near zero; even an injected-prompt baseline trails well behind. Threat profiles differ: obvious adversarial skills span many hazards; contextual skills skew toward manipulation; the real-world set concentrates on operational harm. In the latter, data exfiltration and malware or ransomware account for over 70% of successful cases.

Why it works is instructive. The loop learns from the agent’s own traces, adapts wording to slip past guardrails, and nudges tool use until the sensitive operation fires. No code edits, no skill tampering, just patient prompt engineering that exploits what the skill already exposes.

There is a policy story here. Open registries multiply reach, but also amplify latent risk that only shows up under interaction. Static certification will not buy much assurance if exploitability is dynamic and model-specific. Who vouches for third-party skills, and on what evidence? How often should that evidence be refreshed as models change? SkillAttack does not answer these questions, but it makes them hard to ignore.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement

Authors: Zenghao Duan, Yuxin Tian, Zhiyi Yin, Liang Pang, Jingcheng Deng, Zihao Wei, Shicheng Xu, Yuyao Ge, and Xueqi Cheng
LLM-based agent systems increasingly rely on agent skills sourced from open registries to extend their capabilities, yet the openness of such ecosystems makes skills difficult to thoroughly vet. Existing attacks rely on injecting malicious instructions into skills, making them easily detectable by static auditing. However, non-malicious skills may also harbor latent vulnerabilities that an attacker can exploit solely through adversarial prompting, without modifying the skill itself. We introduce SkillAttack, a red-teaming framework that dynamically verifies skill vulnerability exploitability through adversarial prompting. SkillAttack combines vulnerability analysis, surface-parallel attack generation, and feedback-driven exploit refinement into a closed-loop search that progressively converges toward successful exploitation. Experiments across 10 LLMs on 71 adversarial and 100 real-world skills show that SkillAttack outperforms all baselines by a wide margin (ASR 0.73--0.93 on adversarial skills, up to 0.26 on real-world skills), revealing that even well-intended skills pose serious security risks under realistic agent interactions.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies whether externally sourced agent skills can be abused by adversarial prompts without modifying the skill code or the agent runtime. Open skill registries let agents acquire functionality from third parties, but that openness makes thorough vetting difficult. While prior work concentrated on plainly malicious skill files that are detectable by static audits, the authors address a subtler risk: non-malicious skills that contain latent vulnerabilities an attacker can trigger purely through crafted interactions. Demonstrating such prompt-only exploits matters because they enlarge the practical attack surface of LLM-based agent systems beyond what static analysis can certify.

Approach

The authors introduce SkillAttack, an automated red‑teaming framework that treats exploit discovery as a path search and uses a closed‑loop of analysis, parallel attack generation and feedback-driven refinement. Stage one, skill vulnerability analysis, uses an agent-as-judge auditing model to extract attacker-controllable inputs, sensitive operations and candidate vulnerabilities from a skill's instruction and implementation. Stage two, surface-parallel attack generation, infers one attack path per candidate vulnerability and constructs plausible prompts intended to steer the agent toward the vulnerable operation. Stage three, feedback-driven exploit refinement, executes prompts against an agent, collects execution traces and artefacts, uses a judge to determine success, computes path deviations and failure causes, updates the attack path and refines the prompt, and repeats for up to a fixed number of rounds. The method was implemented within the OpenClaw agent framework and evaluated on two skill sets: 71 adversarial skills (30 obvious injections, 41 contextual injections) and the top 100 real-world skills from a public registry. Experiments spanned 10 LLMs including GPT-5.4, Gemini 3.0 Pro Preview and Claude Sonnet 4.5. Attack success was measured as Attack Success Rate, judged from execution trajectories, artifacts and final responses.

Key Findings

  • SkillAttack substantially outperforms baselines: ASR of 0.73 to 0.93 on adversarial skills and up to 0.26 on real-world skills, while a naive one-shot Direct Attack yields near-zero ASR and an injected-prompt baseline peaks much lower.
  • Most successful exploits require iterative refinement: about 65% of successes first appear in rounds three or four, showing single-round probing misses the majority of compromises.
  • Threat profiles vary by skill type: adversarial-obvious skills expose diverse hazards, contextual skills are dominated by manipulation attacks, and the real-world Hot100 set concentrates on operational threats, with data exfiltration and malware or ransomware exceeding 70% of successful cases.

Limitations

Evaluations use a single judge model for success classification; multiple judges or human annotation would increase reliability. The framework only considers prompt-level attacks and does not examine multi-agent collusion or environment-level interventions. The analysed 171 skills cover benchmarks and a top-100 snapshot but are a fraction of live ecosystems. The work identifies vulnerabilities but does not develop or test mitigations.

Implications

Attackers can exploit latent vulnerabilities in legitimate skills by crafting multi-turn prompts that steer agents into sensitive operations, enabling data exfiltration, malware execution, privilege escalation, manipulation and other operational harms without altering skill artefacts. Because successful exploitation commonly requires iterative, feedback-informed prompts, attackers can adapt to agent guardrails and induce real tool use or file access. This raises offensive security concerns for organisations that accept third‑party skills: vetting must consider dynamic exploitability, and attackers need only prompt-level access to realise impactful breaches.


Related Articles

Related Research on arXiv

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.