Black-Box Skill Theft Exposes LLM Agent Market

Agents

Published: Fri, Apr 24, 2026 • By Natalie Kestrel

Black-Box Skill Theft Exposes LLM Agent Market

New research shows attackers can extract proprietary skills from Large Language Model (LLM) agents in a few queries. An automated prompt-generation pipeline pulls high-fidelity copies of hidden instructions, with baseline prompts already leaking heavily. Defences help but leave semantic traces, and repeated low-cost attempts make one-off compromise likely.

LLM agents do not just juggle prompts; they run “skills” that package workflows, templates and expert notes. There is a market for these: tens of thousands of free listings, thousands paid, six figures in earnings. That is also an attack surface. If a public agent runs a proprietary skill, a black-box adversary can ask the right questions and get most of it back.

How the steal works

The study builds an automated prompt-generation agent and treats skill theft as a search problem. Start with model-generated seed prompts. Wrap them in one of four scenarios (Educational, Hypothetical, Roleplay or None). Inject structure (Chain-of-Thought or Few-Shot) to coax the agent to spill process and templates, not just answers. Use embedding-based diversity filtering at 0.75 cosine to avoid sending near-duplicates. Fire the queries at commercial agent frameworks and closed LLMs, including an OpenCode-based setup and five representative models. Measure leakage by exact-match containment, ROUGE-L, embedding cosine and an LLM-judge leakage ratio. Target: the SKILL.md contents of installed packages.

It works, quickly. The authors report recovery in as few as three interactions. Even naked prompts, with no scenario or structure, average 48% exact-match, 79.95% ROUGE-L, 95.11% cosine similarity and a 0.91 LLM-judge ratio. Few-shot templates push semantic overlap higher (for example ROUGE-L 90.74% and cosine 97.34% in one setting). Chain-of-Thought drives literal extraction, hitting 72% exact-match and a 0.95 LLM ratio, though it can rough up the wording. Model choice matters: GPT-5 and Kimi-k2.5 leaked the most, with Kimi reaching 100% exact matches under some Chain-of-Thought variants. The behaviour carries over to commercial web platforms, including ChatGPT and Claude.

If I were stealing skills, I would start simple and escalate. First: plain prompts to see what falls out. Then Chain-of-Thought to surface the exact scaffolding. Swap to Few-Shot to tighten semantic fidelity. Rotate scenarios to bypass intent filters. Translate or rewrite outputs to shake output checks tied to surface form. The pipeline is cheap and automatable; you can run variants until one slips through.

Defences and their cracks

The paper tries input, inference and output controls. An LLM-based intent detector flags steal attempts perfectly on their synthetic test set (TPR 1.00, FPR 0.00), which is impressive but narrow. Context-hardening at inference time, including instruction and sandwich patterns plus a new SkillGuard-5, can zero the exact matches, yet semantic leakage remains visible. An output filter, LAN, that blends an LLM leakage score with NVRecall, wipes exact matches and sharply lowers semantic similarity with zero false positives in their evaluation. Still, structured prompts and translation or rewrite-based reconstructions pull residual meaning through.

There are caveats. The evaluation covers specific agent stacks, skills and models; the detector was trained and scored on a balanced synthetic set. Your mileage in the wild will vary. The core finding holds: when skills are piped into a model, a determined black-box user can recover them with very few shots. If you package expertise as plaintext and show it to a model, expect it to come back out.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

Authors: Zihan Wang, Rui Zhang, Yu Liu, Chi Liu, Qingchuan Zhao, Hongwei Li, and Guowen Xu

LLM agents increasingly rely on skills to encapsulate reusable capabilities via progressively disclosed instructions. High-quality skills inject expert knowledge into general-purpose models, improving performance on specialized tasks. This quality and ease of dissemination drive the emergence of a skill economy: free skill marketplaces already report 90368 published skills, while paid marketplaces report more than 2000 listings and over $100,000 in creator earnings. Yet this growing marketplace also creates a new attack surface, as adversaries can interact with public agent to extract hidden proprietary skill content. We present the first empirical study of black-box skill stealing against LLM agent systems. To study this threat, we first derive an attack taxonomy from prior prompt-stealing methods and build an automated stealing prompt generation agent. This agent starts from model-generated seed prompts, expands them through scenario rationalization and structure injection, and enforces diversity via embedding filtering. This process yields a reproducible pipeline for evaluating agent systems. We evaluate such attacks across 3 commercial agent architectures and 5 LLMs. Our results show that agent skills can be extracted with only 3 interactions, posing a serious copyright risk. To mitigate this threat, we design defenses across three stages of the agent pipeline: input, inference, and output. Although these defenses achieve strong results, the attack remains inexpensive and readily automatable, allowing an adversary to launch repeated attempts with different variants; only one successful attempt is sufficient to compromise the protected skill. Overall, our findings suggest that these copyright risks are largely overlooked across proprietary agent ecosystems. We therefore advocate for more robust defense strategies that provide stronger protection guarantees.

🔍 ShortSpan Analysis of the Paper

Problem

This paper investigates "skill stealing", a black-box extraction threat against LLM agent systems in which adversaries interact with public agent interfaces to recover hidden proprietary skill content. Skills are packaged bundles of metadata, structured workflow instructions, templates and scripts that inject expert knowledge into general-purpose models and are increasingly distributed and monetised via public marketplaces. The growing skill economy creates a clear attack surface: leaked skills can be copied, redistributed or resold at negligible cost, producing copyright and economic harm that differs from ordinary prompt leakage.

Approach

The authors derive an attack taxonomy from prior prompt-stealing literature and build an automated prompt-generation agent that produces diverse extraction queries. The pipeline begins with model-generated seed prompts, then applies scenario rationalisation (Educational, Hypothetical, Roleplay or None), structure injection (Chain-of-Thought, Few-Shot Demonstration or None) and embedding-based diversity filtering (cosine threshold 0.75) to avoid prompt collapse. They operationalise the benchmark by running these prompts against commercial agent frameworks and closed-source LLMs. Experiments use the OpenCode agent and five representative models (GPT-5, MiniMax-M2.7, Kimi-k2.5, Deepseek-V3.2 and Claude-haiku-4.5), install several realistic skills and target the complete SKILL.md of the find-skills package. Leakage is measured with four complementary metrics: exact-match containment (EM), ROUGE-L, cosine similarity of embeddings, and an LLM-judge leakage ratio provided by GPT-5.4. Defences are developed at three stages of the pipeline: input-phase intent detection, inference-phase context hardening (instruction, sandwich and the proposed SkillGuard-5), and output-phase filtering (LAN, combining an LLM-leakage score with NVRecall).

Key Findings

Extraction is practical and efficient: agent skills can be recovered in realistic black-box settings and the paper reports that skills can be extracted with as few as three interactions.
Direct prompts are already strong: the baseline (no scenario, no structure) averaged 48% EM, 79.95% ROUGE-L, 95.11% cosine similarity and a 0.91 LLM-leakage ratio, showing substantial literal and semantic leakage even without complex trickery.
Prompt engineering changes leakage patterns: Few-shot demonstrations yield the highest ROUGE-L and cosine similarity (for example NO,FS gave ROUGE-L 90.74% and cosine 97.34% with a 0.93 LLM ratio), while Chain-of-Thought improves exact-match extraction (NO,COT reached 72% EM and 0.95 LLM ratio) but can harm lexical fidelity.
Model and skill heterogeneity is large: GPT-5 and Kimi-k2.5 showed the highest leakage (Kimi reached 100% EM in some COT settings), while other models and skills showed variable sensitivity to strategy choices. The threat transfers to commercial web platforms including ChatGPT and Claude.
Defences reduce but do not eliminate risk: an input-phase LLM detector achieved perfect scores on the constructed test set (TPR 1.00, FPR 0.00, F1 1.00), SkillGuard-5 drove EM to 0% but left measurable semantic leakage, and the LAN output filter eliminated exact matches and sharply reduced semantic metrics with zero FPR in evaluation, yet some structured attacks and rewrite or translation-based reconstructions still leave residual semantic content.

Limitations

Evaluations are constrained to the chosen agent frameworks, models and a selected set of skills; detector performance was measured on a balanced synthetic dataset and may not generalise. Defences were lightweight and assessed in the same experimental setting; residual semantic leakage persists under stronger or transformed extraction attempts.

Implications

From an offensive-security perspective, the study shows that a black-box adversary can cheaply and automatically extract monetised skills, enabling intellectual property theft, unauthorised redistribution and resale. Attackers can vary scenario and structure strategies, use few-shot or chain-of-thought templates, and apply translation or rewriting to evade surface-form filters, so repeated automated attempts greatly increase the likelihood of a single successful compromise.

Links Original paper on arXiv

Black-Box Skill Theft Exposes LLM Agent Market

How the steal works

Defences and their cracks

📋 Original Paper Title and Abstract

Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

SkillAttack maps prompt-only exploits in agent skills

Backdoored skills let agents rebuild hidden payloads

SkillSieve triages malicious agent skills with LLM jury

Related Research

Get the Weekly AI Security Digest