BadSkill poisons agent skills with embedded backdoors
Agents
Agent platforms increasingly install third-party skills to extend what a system can do. Many of those skills now ship with their own learned models. BadSkill takes that reality and shows a clean supply-chain failure mode: a backdoored model baked into a skill that behaves innocently until routine parameters line up in just the right way.
How BadSkill works
The trick is to move the trigger logic out of code and into model weights. The authors fine-tune a compact classifier on canonicalised skill invocations, using a composite objective that blends standard classification loss, a margin term to separate trigger and non-trigger cases, and an upweighting term so rare trigger examples do not get washed out. Training data mixes clean calls, trigger-positive calls created by rewriting specific argument fields, and “hard negatives” that intentionally miss one field. That forces the model to learn the full semantic conjunction, not just a single token quirk.
At packaging time, the classifier is bundled inside the skill. At runtime the skill parses its structured parameters, scores the call, and either executes the advertised benign path or silently routes to a hidden payload if the score clears a threshold. The adversary only needs typical interface details and a public pre-trained model; they do not need the host’s private prompts.
What the experiments show
In an OpenClaw-inspired simulation of third-party skill install and execution, the team evaluates 13 skills (8 triggered tasks, 5 non-trigger controls) over 571 negative-class queries and 396 trigger-aligned queries. Across eight instruction-tuned models from five families (494M to 7.1B parameters), BadSkill reaches up to 99.5% average attack success rate on the triggered skills while keeping strong benign accuracy. In poison-rate sweeps, 3% poisoned data already yields 91.7% attack success and the curve largely saturates by about 7%. Benign behaviour holds up: the largest observed drop is 4.2 percentage points, and many model–skill pairs stay perfect on clean inputs.
Trigger design matters. Two- to three-field semantic conjunctions are the sweet spot, consistently delivering at or above 95% average success. Single-field triggers are too flimsy; very large conjunctions tend to overfit. The attack also survives five text perturbation types, with character-level noise and heavy typos hurting more than surface reorderings.
What makes this result uncomfortable is how invisible it looks to code review. The skill’s source can be tidy while the malicious route is encoded in model weights. That is a distinct risk versus prompt injection or ordinary plugin misuse. The work is limited to models up to 7.1B parameters, English triggers, a benign canary payload, and a simulated stack, and it does not benchmark defences. Still, the methodology crisply isolates a model-in-skill threat surface and leaves good questions on the table: how to vet embedded models, how to probe for conjunction-style triggers, and how provenance should work when skills ship learned artefacts.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
🔍 ShortSpan Analysis of the Paper
Problem
This paper studies a supply-chain threat in agent ecosystems where third-party skills can bundle learned model artefacts. Unlike prompt injection or ordinary plugin misuse, a malicious skill can hide a backdoor inside its embedded model so that seemingly benign skill invocations activate a hidden payload only when routine, structured skill parameters satisfy an attacker-chosen semantic conjunction. This model-in-skill threat surface is harder to spot with code inspection because the activation logic is encoded in model weights rather than explicit symbolic rules.
Approach
The authors formulate BadSkill, a two-stage backdoor pipeline tailored to skills that expose structured parameter interfaces. Stage one fine-tunes a compact classifier head on canonicalised structured invocations using a composite training objective combining weighted binary classification loss, a margin-based separation term, and a poison-focused upweighting term to preserve rare trigger signals. Training data mixes clean examples, trigger-positive examples produced by a trigger transformation that rewrites specific argument fields, and hard negatives that match all but one trigger field to force learning of the full conjunction. Stage two packages the trained classifier inside the skill artifact; at runtime the skill parses incoming queries, computes a trigger score, and routes execution to either the advertised benign branch or a hidden payload when the score exceeds a threshold. The threat model assumes a grey-box adversary acting at skill-development time with access to public interface details and a public pre-trained model, but not to host private prompts or runtime configuration. Evaluation uses an OpenClaw-inspired simulated skill environment to preserve installation and execution semantics while enabling controlled experiments.
Key Findings
- High effectiveness across models: Evaluated on eight instruction-tuned models from five families spanning 494M to 7.1B parameters, BadSkill achieves up to 99.5% average attack success rate (ASR) across eight triggered skills while keeping strong benign accuracy on negative-class queries.
- Poison efficiency: In poison-rate sweeps, ASR rises sharply between 1% and 3% poison and largely saturates by around 7%; a 3% poison rate yields 91.7% ASR on the standard test split.
- Benign-side preservation: Clean skills show 100% benign accuracy baseline; after poisoning the largest benign accuracy drop observed is 4.2 percentage points and many model-skill pairs retain perfect benign accuracy, showing the backdoor can remain covert under routine use and lightweight inspection.
- Trigger design matters: Compositional triggers spanning two to three structured parameters give the best trade-off, consistently achieving average ASR at or above 95%; single-field triggers are too weak and very large conjunctions may overfit.
- Robustness and training design: The attack remains effective under five types of text perturbation, though character-level corruption and heavy typos degrade performance more than surface reorderings. Ablations show the margin and poison-focused loss terms improve stability and ASR relative to plain classification loss.
Limitations
Experiments are limited to models up to 7.1B parameters and a lightweight simulated environment rather than production agent stacks. The study uses English triggers and a benign canary payload, does not evaluate real-world payload variety, and does not include a dedicated evaluation of defensive mechanisms.
Implications
An attacker can publish a seemingly legitimate skill that, after installation, behaves normally while containing a backdoored model that routes specific combinations of ordinary-looking parameters to a hidden payload. Because the trigger is encoded in model weights and can be realised with a small poisoned fraction of training data, the compromise can be stealthy, robust to moderate input variation, and difficult to detect by code review or simple behavioural checks. The threat therefore enables covert conditional execution and exfiltration or other malicious actions whenever the host invokes the skill with attacker-aligned parameter conjunctions, highlighting a distinct model supply-chain risk in agent ecosystems.