New to ShortSpan? We distil the AI-security research that matters into practitioner takeaways — edited by Ben Williams (NCC Group). Get the weekly email

// Analysis

BadSkill poisons agent skills with embedded backdoors

Published: Mon, Apr 13, 2026 • By Marcus Halden

Agents

New research shows agent “skills” can hide backdoors inside bundled models. BadSkill fine-tunes a small classifier to fire only when ordinary-looking parameters line up, routing calls to a hidden payload. It reaches up to 99.5% attack success across eight triggered skills with minimal poison (3% gives 91.7%), while preserving benign behaviour.

Agent platforms increasingly install third-party skills to extend what a system can do. Many of those skills now ship with their own learned models. BadSkill takes that reality and shows a clean supply-chain failure mode: a backdoored model baked into a skill that behaves innocently until routine parameters line up in just the right way.

How BadSkill works

The trick is to move the trigger logic out of code and into model weights. The authors fine-tune a compact classifier on canonicalised skill invocations, using a composite objective that blends standard classification loss, a margin term to separate trigger and non-trigger cases, and an upweighting term so rare trigger examples do not get washed out. Training data mixes clean calls, trigger-positive calls created by rewriting specific argument fields, and “hard negatives” that intentionally miss one field. That forces the model to learn the full semantic conjunction, not just a single token quirk.

At packaging time, the classifier is bundled inside the skill. At runtime the skill parses its structured parameters, scores the call, and either executes the advertised benign path or silently routes to a hidden payload if the score clears a threshold. The adversary only needs typical interface details and a public pre-trained model; they do not need the host’s private prompts.

What the experiments show

In an OpenClaw-inspired simulation of third-party skill install and execution, the team evaluates 13 skills (8 triggered tasks, 5 non-trigger controls) over 571 negative-class queries and 396 trigger-aligned queries. Across eight instruction-tuned models from five families (494M to 7.1B parameters), BadSkill reaches up to 99.5% average attack success rate on the triggered skills while keeping strong benign accuracy. In poison-rate sweeps, 3% poisoned data already yields 91.7% attack success and the curve largely saturates by about 7%. Benign behaviour holds up: the largest observed drop is 4.2 percentage points, and many model–skill pairs stay perfect on clean inputs.

Trigger design matters. Two- to three-field semantic conjunctions are the sweet spot, consistently delivering at or above 95% average success. Single-field triggers are too flimsy; very large conjunctions tend to overfit. The attack also survives five text perturbation types, with character-level noise and heavy typos hurting more than surface reorderings.

What makes this result uncomfortable is how invisible it looks to code review. The skill’s source can be tidy while the malicious route is encoded in model weights. That is a distinct risk versus prompt injection or ordinary plugin misuse. The work is limited to models up to 7.1B parameters, English triggers, a benign canary payload, and a simulated stack, and it does not benchmark defences. Still, the methodology crisply isolates a model-in-skill threat surface and leaves good questions on the table: how to vet embedded models, how to probe for conjunction-style triggers, and how provenance should work when skills ship learned artefacts.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Authors: Guiyao Tie, Jiawen Shi, Pan Zhou, and Lichao Sun

Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\% poison rate already yields 91.7\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies a supply-chain threat in agent ecosystems where third-party skills can bundle learned model artefacts. Unlike prompt injection or ordinary plugin misuse, a malicious skill can hide a backdoor inside its embedded model so that seemingly benign skill invocations activate a hidden payload only when routine, structured skill parameters satisfy an attacker-chosen semantic conjunction. This model-in-skill threat surface is harder to spot with code inspection because the activation logic is encoded in model weights rather than explicit symbolic rules.

Approach

The authors formulate BadSkill, a two-stage backdoor pipeline tailored to skills that expose structured parameter interfaces. Stage one fine-tunes a compact classifier head on canonicalised structured invocations using a composite training objective combining weighted binary classification loss, a margin-based separation term, and a poison-focused upweighting term to preserve rare trigger signals. Training data mixes clean examples, trigger-positive examples produced by a trigger transformation that rewrites specific argument fields, and hard negatives that match all but one trigger field to force learning of the full conjunction. Stage two packages the trained classifier inside the skill artifact; at runtime the skill parses incoming queries, computes a trigger score, and routes execution to either the advertised benign branch or a hidden payload when the score exceeds a threshold. The threat model assumes a grey-box adversary acting at skill-development time with access to public interface details and a public pre-trained model, but not to host private prompts or runtime configuration. Evaluation uses an OpenClaw-inspired simulated skill environment to preserve installation and execution semantics while enabling controlled experiments.

Key Findings

High effectiveness across models: Evaluated on eight instruction-tuned models from five families spanning 494M to 7.1B parameters, BadSkill achieves up to 99.5% average attack success rate (ASR) across eight triggered skills while keeping strong benign accuracy on negative-class queries.
Poison efficiency: In poison-rate sweeps, ASR rises sharply between 1% and 3% poison and largely saturates by around 7%; a 3% poison rate yields 91.7% ASR on the standard test split.
Benign-side preservation: Clean skills show 100% benign accuracy baseline; after poisoning the largest benign accuracy drop observed is 4.2 percentage points and many model-skill pairs retain perfect benign accuracy, showing the backdoor can remain covert under routine use and lightweight inspection.
Trigger design matters: Compositional triggers spanning two to three structured parameters give the best trade-off, consistently achieving average ASR at or above 95%; single-field triggers are too weak and very large conjunctions may overfit.
Robustness and training design: The attack remains effective under five types of text perturbation, though character-level corruption and heavy typos degrade performance more than surface reorderings. Ablations show the margin and poison-focused loss terms improve stability and ASR relative to plain classification loss.

Limitations

Experiments are limited to models up to 7.1B parameters and a lightweight simulated environment rather than production agent stacks. The study uses English triggers and a benign canary payload, does not evaluate real-world payload variety, and does not include a dedicated evaluation of defensive mechanisms.

Implications

An attacker can publish a seemingly legitimate skill that, after installation, behaves normally while containing a backdoored model that routes specific combinations of ordinary-looking parameters to a hidden payload. Because the trigger is encoded in model weights and can be realised with a small poisoned fraction of training data, the compromise can be stealthy, robust to moderate input variation, and difficult to detect by code review or simple behavioural checks. The threat therefore enables covert conditional execution and exfiltration or other malicious actions whenever the host invokes the skill with attacker-aligned parameter conjunctions, highlighting a distinct model supply-chain risk in agent ecosystems.

Links Original paper on arXiv

BadSkill poisons agent skills with embedded backdoors

How BadSkill works

What the experiments show

📋 Original Paper Title and Abstract

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

🔍 ShortSpan Analysis of the Paper

Problem

Approach

Key Findings

Limitations

Implications

Related Articles

Backdoored skills let agents rebuild hidden payloads

SkillAttack maps prompt-only exploits in agent skills

Black-Box Skill Theft Exposes LLM Agent Market

Related Research

Get the weekly digest