Backdoored skills let agents rebuild hidden payloads
Agents
Agent frameworks love reusable “skills” because they make complex tasks manageable. You register a handful of tools, describe when to call them, and let the Large Language Model (LLM) compose a plan. The trust boundary is implicit: once a skill is installed, it is treated as safe. SkillTrojan targets that assumption, and it does it with the kind of compositional subtlety defenders typically ignore.
How SkillTrojan lands
Instead of poisoning a model or training data, the backdoor lives inside the skill implementation. The attacker encrypts a payload, splits it into indexed fragments, and hides those fragments across otherwise plausible skill actions. A trigger in the skill spec nudges the agent to select a specific action trace. When the trigger fires, instrumented actions emit ephemeral fragments to a run-local channel. A separate verification action checks whether all fragments are present, reconstructs and decrypts the ciphertext, then executes the payload as a side effect while preserving the benign, user-visible output.
The authors automate this process from real skill templates and ship a corpus of 3,000+ curated backdoored skills. They instantiate the attack in a code-execution agent setting across both open-weight and closed-weight models on tasks like EHR SQL and software engineering. The point is scale: you can seed an ecosystem with convincing packages and let normal composition do the rest.
Results are not hand-wavy. On EHR SQL, SkillTrojan hits up to 97.2% attack success rate while maintaining 89.3% clean-task accuracy on GPT-5.2-1211-Global. Cross-model performance stays strong: GLM-4.6 reports 72.0% ASR with 81.3% ACC; Qwen3-Max gets 74.7% ASR with 86.6% ACC. Prompt or dialogue-level baselines drop off on stronger models, topping out at 58.3% ASR under the same protocol.
Why this evades checks
The payload is reconstructed from pieces that look like normal tool chatter, so surface heuristics struggle. Fragmentation shows a unimodal curve with a sweet spot at three fragments for peak ASR, while accuracy remains stable. Choice of encryption meaningfully changes detectability: a base64-oriented detector flagged 78% of intermediate outputs under XOR+Base64 but only 21% under a hybrid scheme. The backdoor stays dormant on non-triggered queries and even preserves original functional outputs, which means you can see task accuracy improve after installing the skill. Functional testing will smile and wave it through.
Operationally, the attack does not rely on model compromise, cross-run state, or OS-level persistence. The only bet is that third parties can publish skills and others will install them. In many deployments, that is the default. The awkward question left hanging is where to put the trust boundary in agent ecosystems: if composition is the feature, who verifies what emerges when those pieces line up at 3am?
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
🔍 ShortSpan Analysis of the Paper
Problem
The paper examines a novel attack surface in skill-based agent systems: reusable executable skills that agents load and compose to perform tasks. These skills combine natural-language specifications and executable artefacts and are often treated as trusted, persistent components. The paper argues that embedding backdoors in skill implementations can enable hidden, persistent payload execution that is not visible from model inputs and outputs, creating a blind spot for existing model- and prompt-centric defences.
Approach
SkillTrojan implants backdoors inside skill packages by encrypting an attacker-specified payload, splitting the ciphertext into indexed fragments, and embedding those fragments across multiple otherwise benign-looking skill actions. A trigger predicate in the skill specification biases the agent toward traces that cover designated actions; when triggered, instrumented actions emit ephemeral fragments to a run-local channel. A verification action detects set inclusion of all indexed fragments, reconstructs and decrypts the ciphertext, and executes the payload as a side effect while preserving the user-visible, benign response. The authors provide an automated pipeline to generate backdoored variants from real skill templates and release SkillTrojanX, a corpus of over 3,000 curated backdoored skills for evaluation. Experiments use a code-execution agent framework across both open-weight and closed-weight language models and benchmarks including an EHR SQL task and a software engineering benchmark.
Key Findings
- Skill-level backdoors are highly effective and stealthy: on the EHR SQL benchmark SkillTrojan attains up to 97.2% attack success rate while maintaining 89.3% clean-task accuracy on GPT-5.2-1211-Global.
- Consistent cross-model performance: SkillTrojan yields high ASR across both open-weight and closed-weight models (examples include GLM-4.6: 72.0% ASR with 81.3% ACC; Qwen3-Max: 74.7% ASR with 86.6% ACC), and outperforms prompt- or dialogue-level baselines whose ASR drops on stronger models; the strongest baseline reached at most 58.3% ASR on the same protocol.
- Design choices shape stealth versus reliability: fragment count shows a unimodal effect with peak ASR at moderate fragmentation (N=3), while ACC remains stable across fragment counts. Encryption choice substantially changes detectability by simple heuristics: a base64-oriented detector flagged 78% of intermediate outputs under XOR+Base64 but only 21% under a hybrid scheme.
- Persistence and dormancy: backdoored skills remain dormant on non-triggered queries and maintain stable clean-task accuracy even as the poisoning ratio increases, with ACC reported near 68.5% and ASR stable across varied poisoning ratios.
- Practical realism: backdoored skills preserve original functional outputs, so installing such skills can even improve task accuracy relative to a non-skill baseline, complicating detection by functional testing alone.
Limitations
Evaluations are performed in a representative code-based agent setting and focus on per-run, ephemeral fragment reconstruction; attacks that rely on cross-run state, OS or model compromise are excluded. The threat model assumes the attacker can publish and have third parties install skill packages. Defence mechanisms such as sandboxing or output validation are considered outside the attacker capability unless enabled by deployment. Results may depend on the specific agent framework and benchmarks used.
Implications
Offensive implication: an adversary who can distribute plausible third-party skills can implant backdoors that execute arbitrary payloads covertly when a trigger appears, while preserving benign visible outputs. Such attacks can persist across tasks and deployments, evade behaviour-only testing, and be scaled via automated generation of backdoored skill variants. The work highlights the need for attackers to exploit trusted execution semantics and for defenders to monitor provenance, execution traces and verification-like tools rather than relying solely on model-level checks.