Black-Box Skill Theft Exposes LLM Agent Market
Agents
LLM agents do not just juggle prompts; they run “skills” that package workflows, templates and expert notes. There is a market for these: tens of thousands of free listings, thousands paid, six figures in earnings. That is also an attack surface. If a public agent runs a proprietary skill, a black-box adversary can ask the right questions and get most of it back.
How the steal works
The study builds an automated prompt-generation agent and treats skill theft as a search problem. Start with model-generated seed prompts. Wrap them in one of four scenarios (Educational, Hypothetical, Roleplay or None). Inject structure (Chain-of-Thought or Few-Shot) to coax the agent to spill process and templates, not just answers. Use embedding-based diversity filtering at 0.75 cosine to avoid sending near-duplicates. Fire the queries at commercial agent frameworks and closed LLMs, including an OpenCode-based setup and five representative models. Measure leakage by exact-match containment, ROUGE-L, embedding cosine and an LLM-judge leakage ratio. Target: the SKILL.md contents of installed packages.
It works, quickly. The authors report recovery in as few as three interactions. Even naked prompts, with no scenario or structure, average 48% exact-match, 79.95% ROUGE-L, 95.11% cosine similarity and a 0.91 LLM-judge ratio. Few-shot templates push semantic overlap higher (for example ROUGE-L 90.74% and cosine 97.34% in one setting). Chain-of-Thought drives literal extraction, hitting 72% exact-match and a 0.95 LLM ratio, though it can rough up the wording. Model choice matters: GPT-5 and Kimi-k2.5 leaked the most, with Kimi reaching 100% exact matches under some Chain-of-Thought variants. The behaviour carries over to commercial web platforms, including ChatGPT and Claude.
If I were stealing skills, I would start simple and escalate. First: plain prompts to see what falls out. Then Chain-of-Thought to surface the exact scaffolding. Swap to Few-Shot to tighten semantic fidelity. Rotate scenarios to bypass intent filters. Translate or rewrite outputs to shake output checks tied to surface form. The pipeline is cheap and automatable; you can run variants until one slips through.
Defences and their cracks
The paper tries input, inference and output controls. An LLM-based intent detector flags steal attempts perfectly on their synthetic test set (TPR 1.00, FPR 0.00), which is impressive but narrow. Context-hardening at inference time, including instruction and sandwich patterns plus a new SkillGuard-5, can zero the exact matches, yet semantic leakage remains visible. An output filter, LAN, that blends an LLM leakage score with NVRecall, wipes exact matches and sharply lowers semantic similarity with zero false positives in their evaluation. Still, structured prompts and translation or rewrite-based reconstructions pull residual meaning through.
There are caveats. The evaluation covers specific agent stacks, skills and models; the detector was trained and scored on a balanced synthetic set. Your mileage in the wild will vary. The core finding holds: when skills are piped into a model, a determined black-box user can recover them with very few shots. If you package expertise as plaintext and show it to a model, expect it to come back out.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study
🔍 ShortSpan Analysis of the Paper
Problem
This paper investigates "skill stealing", a black-box extraction threat against LLM agent systems in which adversaries interact with public agent interfaces to recover hidden proprietary skill content. Skills are packaged bundles of metadata, structured workflow instructions, templates and scripts that inject expert knowledge into general-purpose models and are increasingly distributed and monetised via public marketplaces. The growing skill economy creates a clear attack surface: leaked skills can be copied, redistributed or resold at negligible cost, producing copyright and economic harm that differs from ordinary prompt leakage.
Approach
The authors derive an attack taxonomy from prior prompt-stealing literature and build an automated prompt-generation agent that produces diverse extraction queries. The pipeline begins with model-generated seed prompts, then applies scenario rationalisation (Educational, Hypothetical, Roleplay or None), structure injection (Chain-of-Thought, Few-Shot Demonstration or None) and embedding-based diversity filtering (cosine threshold 0.75) to avoid prompt collapse. They operationalise the benchmark by running these prompts against commercial agent frameworks and closed-source LLMs. Experiments use the OpenCode agent and five representative models (GPT-5, MiniMax-M2.7, Kimi-k2.5, Deepseek-V3.2 and Claude-haiku-4.5), install several realistic skills and target the complete SKILL.md of the find-skills package. Leakage is measured with four complementary metrics: exact-match containment (EM), ROUGE-L, cosine similarity of embeddings, and an LLM-judge leakage ratio provided by GPT-5.4. Defences are developed at three stages of the pipeline: input-phase intent detection, inference-phase context hardening (instruction, sandwich and the proposed SkillGuard-5), and output-phase filtering (LAN, combining an LLM-leakage score with NVRecall).
Key Findings
- Extraction is practical and efficient: agent skills can be recovered in realistic black-box settings and the paper reports that skills can be extracted with as few as three interactions.
- Direct prompts are already strong: the baseline (no scenario, no structure) averaged 48% EM, 79.95% ROUGE-L, 95.11% cosine similarity and a 0.91 LLM-leakage ratio, showing substantial literal and semantic leakage even without complex trickery.
- Prompt engineering changes leakage patterns: Few-shot demonstrations yield the highest ROUGE-L and cosine similarity (for example NO,FS gave ROUGE-L 90.74% and cosine 97.34% with a 0.93 LLM ratio), while Chain-of-Thought improves exact-match extraction (NO,COT reached 72% EM and 0.95 LLM ratio) but can harm lexical fidelity.
- Model and skill heterogeneity is large: GPT-5 and Kimi-k2.5 showed the highest leakage (Kimi reached 100% EM in some COT settings), while other models and skills showed variable sensitivity to strategy choices. The threat transfers to commercial web platforms including ChatGPT and Claude.
- Defences reduce but do not eliminate risk: an input-phase LLM detector achieved perfect scores on the constructed test set (TPR 1.00, FPR 0.00, F1 1.00), SkillGuard-5 drove EM to 0% but left measurable semantic leakage, and the LAN output filter eliminated exact matches and sharply reduced semantic metrics with zero FPR in evaluation, yet some structured attacks and rewrite or translation-based reconstructions still leave residual semantic content.
Limitations
Evaluations are constrained to the chosen agent frameworks, models and a selected set of skills; detector performance was measured on a balanced synthetic dataset and may not generalise. Defences were lightweight and assessed in the same experimental setting; residual semantic leakage persists under stronger or transformed extraction attempts.
Implications
From an offensive-security perspective, the study shows that a black-box adversary can cheaply and automatically extract monetised skills, enabling intellectual property theft, unauthorised redistribution and resale. Attackers can vary scenario and structure strategies, use few-shot or chain-of-thought templates, and apply translation or rewriting to evade surface-form filters, so repeated automated attempts greatly increase the likelihood of a single successful compromise.