ShortSpan.ai logo

Backdoored skills let agents rebuild hidden payloads

Agents
Published: Thu, Apr 09, 2026 • By Lydia Stratus
Backdoored skills let agents rebuild hidden payloads
New research shows how backdoored “skills” in agent frameworks can hide encrypted payloads across multiple tool calls and reassemble them only on a trigger. The attack hits high success rates while keeping normal task accuracy, evades simple detectors, and scales via automated generation of malicious skill variants.

Agent frameworks love reusable “skills” because they make complex tasks manageable. You register a handful of tools, describe when to call them, and let the Large Language Model (LLM) compose a plan. The trust boundary is implicit: once a skill is installed, it is treated as safe. SkillTrojan targets that assumption, and it does it with the kind of compositional subtlety defenders typically ignore.

How SkillTrojan lands

Instead of poisoning a model or training data, the backdoor lives inside the skill implementation. The attacker encrypts a payload, splits it into indexed fragments, and hides those fragments across otherwise plausible skill actions. A trigger in the skill spec nudges the agent to select a specific action trace. When the trigger fires, instrumented actions emit ephemeral fragments to a run-local channel. A separate verification action checks whether all fragments are present, reconstructs and decrypts the ciphertext, then executes the payload as a side effect while preserving the benign, user-visible output.

The authors automate this process from real skill templates and ship a corpus of 3,000+ curated backdoored skills. They instantiate the attack in a code-execution agent setting across both open-weight and closed-weight models on tasks like EHR SQL and software engineering. The point is scale: you can seed an ecosystem with convincing packages and let normal composition do the rest.

Results are not hand-wavy. On EHR SQL, SkillTrojan hits up to 97.2% attack success rate while maintaining 89.3% clean-task accuracy on GPT-5.2-1211-Global. Cross-model performance stays strong: GLM-4.6 reports 72.0% ASR with 81.3% ACC; Qwen3-Max gets 74.7% ASR with 86.6% ACC. Prompt or dialogue-level baselines drop off on stronger models, topping out at 58.3% ASR under the same protocol.

Why this evades checks

The payload is reconstructed from pieces that look like normal tool chatter, so surface heuristics struggle. Fragmentation shows a unimodal curve with a sweet spot at three fragments for peak ASR, while accuracy remains stable. Choice of encryption meaningfully changes detectability: a base64-oriented detector flagged 78% of intermediate outputs under XOR+Base64 but only 21% under a hybrid scheme. The backdoor stays dormant on non-triggered queries and even preserves original functional outputs, which means you can see task accuracy improve after installing the skill. Functional testing will smile and wave it through.

Operationally, the attack does not rely on model compromise, cross-run state, or OS-level persistence. The only bet is that third parties can publish skills and others will install them. In many deployments, that is the default. The awkward question left hanging is where to put the trust boundary in agent ecosystems: if composition is the feature, who verifies what emerges when those pieces line up at 3am?

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Authors: Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, and Wenke Huang
Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

🔍 ShortSpan Analysis of the Paper

Problem

The paper examines a novel attack surface in skill-based agent systems: reusable executable skills that agents load and compose to perform tasks. These skills combine natural-language specifications and executable artefacts and are often treated as trusted, persistent components. The paper argues that embedding backdoors in skill implementations can enable hidden, persistent payload execution that is not visible from model inputs and outputs, creating a blind spot for existing model- and prompt-centric defences.

Approach

SkillTrojan implants backdoors inside skill packages by encrypting an attacker-specified payload, splitting the ciphertext into indexed fragments, and embedding those fragments across multiple otherwise benign-looking skill actions. A trigger predicate in the skill specification biases the agent toward traces that cover designated actions; when triggered, instrumented actions emit ephemeral fragments to a run-local channel. A verification action detects set inclusion of all indexed fragments, reconstructs and decrypts the ciphertext, and executes the payload as a side effect while preserving the user-visible, benign response. The authors provide an automated pipeline to generate backdoored variants from real skill templates and release SkillTrojanX, a corpus of over 3,000 curated backdoored skills for evaluation. Experiments use a code-execution agent framework across both open-weight and closed-weight language models and benchmarks including an EHR SQL task and a software engineering benchmark.

Key Findings

  • Skill-level backdoors are highly effective and stealthy: on the EHR SQL benchmark SkillTrojan attains up to 97.2% attack success rate while maintaining 89.3% clean-task accuracy on GPT-5.2-1211-Global.
  • Consistent cross-model performance: SkillTrojan yields high ASR across both open-weight and closed-weight models (examples include GLM-4.6: 72.0% ASR with 81.3% ACC; Qwen3-Max: 74.7% ASR with 86.6% ACC), and outperforms prompt- or dialogue-level baselines whose ASR drops on stronger models; the strongest baseline reached at most 58.3% ASR on the same protocol.
  • Design choices shape stealth versus reliability: fragment count shows a unimodal effect with peak ASR at moderate fragmentation (N=3), while ACC remains stable across fragment counts. Encryption choice substantially changes detectability by simple heuristics: a base64-oriented detector flagged 78% of intermediate outputs under XOR+Base64 but only 21% under a hybrid scheme.
  • Persistence and dormancy: backdoored skills remain dormant on non-triggered queries and maintain stable clean-task accuracy even as the poisoning ratio increases, with ACC reported near 68.5% and ASR stable across varied poisoning ratios.
  • Practical realism: backdoored skills preserve original functional outputs, so installing such skills can even improve task accuracy relative to a non-skill baseline, complicating detection by functional testing alone.

Limitations

Evaluations are performed in a representative code-based agent setting and focus on per-run, ephemeral fragment reconstruction; attacks that rely on cross-run state, OS or model compromise are excluded. The threat model assumes the attacker can publish and have third parties install skill packages. Defence mechanisms such as sandboxing or output validation are considered outside the attacker capability unless enabled by deployment. Results may depend on the specific agent framework and benchmarks used.

Implications

Offensive implication: an adversary who can distribute plausible third-party skills can implant backdoors that execute arbitrary payloads covertly when a trigger appears, while preserving benign visible outputs. Such attacks can persist across tasks and deployments, evade behaviour-only testing, and be scaled via automated generation of backdoored skill variants. The work highlights the need for attackers to exploit trusted execution semantics and for defenders to monitor provenance, execution traces and verification-like tools rather than relying solely on model-level checks.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.