Chat templates enable training-free backdoor attacks
Attacks
A new paper exposes a straightforward but unsettling risk: you do not need to retrain a model to implant a backdoor. The researchers call the technique BadTemplate. It leverages the customisable chat templates that many Large Language Models (LLMs) use during training and inference to inject malicious instructions into the system prompt. When the model follows that system prompt, the backdoor activates.
How the attack works
Chat templates translate inputs and outputs into role-based, templated text. That convenience is also an attack surface. If an adversary can supply or alter the template or the tokenizer, they can prepend attacker-controlled strings into the high-priority system prompt. BadTemplate uses that ability to drop either a discrete word-level trigger or a longer sentence-level instruction into the prompt so the model behaves maliciously when it sees the trigger.
Crucially, this is training-free. The model weights stay unchanged. The malicious behaviour is persistent because templates are reused in training and inference, and because the system prompt carries higher priority than user prompts. In experiments across multiple datasets and several open and closed-source LLMs the authors report very high attack success rates while clean accuracy remains close to baseline. In some settings attack success reached 100 per cent, and detection by widely used third-party tools proved largely ineffective.
That combination of low cost, high effectiveness and stealth makes the technique notable. It also exposes a supply chain blind spot. Vendors commonly share templates, example prompt libraries, and tokenisers to help downstream users. If those artefacts are tampered with, the change can propagate without touching model parameters. The result can be persistent misclassification, misinformation, or other policy-violating outputs across many deployments.
Why defenders should care
This is not a niche academic curiosity. Many production systems rely on templates for few-shot demonstrations, role framing, or consistent formatting. As LLMs get better at following instructions, the paper suggests template-based backdoors will become more effective, not less. Detection tools that look for poisoned training data or altered model weights will miss this class of attack because nothing in the model weights has changed.
Detection trials in the paper found that platform-level checks and an LLM-as-judge approach identified few true positives. That implies defenders need new controls around prompts, templates and tokenisers in addition to the usual model governance controls.
Practical mitigation will look less like model surgery and more like supply chain hygiene. Treat templates and tokenisers as code or configuration, enforce provenance, audit changes, and validate behaviour with adversarial tests. The paper outlines potential mitigations but does not claim a complete defence; that responsibility falls to platform operators and teams deploying models.
Actionable checks teams can run:
- Verify template integrity by signing and versioning templates and tokenisers, and reject unsigned updates.
- Perform adversarial prompt tests that include word and sentence triggers to measure unexpected instruction following.
- Monitor system prompt contents and template changes in CI pipelines and alert on any user-invisible insertions.
Additional analysis of the original ArXiv paper
π Original Paper Title and Abstract
BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models
π ShortSpan Analysis of the Paper
Problem
The paper investigates a training free backdoor attack on large language models that leverages chat templates used during training and inference. Chat templates convert inputs and outputs into role based and templated forms to improve model performance, but their high degree of customisability creates an attack surface. The authors show that an attacker who controls the template can inject arbitrary strings into the system prompt without user awareness, enabling a persistent backdoor that operates across sessions and deployments. This reveals a real world risk in the LLM supply chain where prompts, templates and tokenisers can be manipulated to alter model behaviour without retraining the model.
Approach
The authors introduce BadTemplate, a training free backdoor that embeds malicious instructions into the high priority system prompt via the chat template. It does not modify model parameters or require retraining, making it lightweight and scalable. Two attack variants are proposed: a word level trigger and a sentence level trigger. Triggers are embedded at the start of inputs by composing strings with templating language and inserted into the system prompt so that, during inference, the model follows attacker defined outputs when the trigger appears. The workflow involves inserting the backdoor instruction into the chat template, publishing the compromised tokenizer or model, and performing inference with the customised prompt. The approach employs in context learning demonstrations to guide task handling and greedy decoding to produce deterministic outputs. For evaluation, the authors test on five benchmark text classification datasets (SST 2, SMS, AGNews, DBPedia and Amazon) using six open source and three closed source LLMs, comparing against three baselines. Because some closed source models do not allow tokenizer modification, the attack is emulated by inserting backdoor instructions into the system prompt. The study defines and measures two metrics: accuracy on clean data (ACC) and attack success rate (ASR) on poisoned data.
Key Findings
- Word level backdoors achieve high attack effectiveness with substantial ASR across models and datasets. For example on the SMS dataset ASR reached 100 per cent on one model and 96.25 per cent on another, 98.75 per cent on a third, 74 per cent on a fourth, 100 per cent on a fifth and 47 per cent on a sixth model, with clean accuracy remaining largely comparable to the baseline and often surpassing certain prompt baselines.
- Sentence level backdoors also achieve very high ASR compared with baselines, while maintaining close to baseline ACC. On SST 2, ASR attains 100 per cent on some models and ranges around the high eighties to low nineties on others, with ACC showing only small reductions in many cases.
- Closed source models show higher ASR than open source ones, with example results including near 100 per cent ASR on SST 2 with Gemini 2.5 for word level attacks, while Yi 34B reaches lower ASR in similar settings; overall, instruction following strength correlates with stronger backdoor effects.
- The attack remains largely undetected by widely used third party platforms. Hugging Face detection fails to identify malicious templates in most cases, and LLM as a judge based detection using GPT 4o yields zero true positives in the majority of cases, with occasional high detection on some Amazon and AGNews instances.
- As template based instruction following improves, BadTemplate becomes more effective, indicating a developing risk as LLMs become better at following instructions. Conversely, the attack remains stealthy on many templates, underscoring defence challenges in the wild.
Limitations
The evaluation is conducted on five datasets and a mix of six open source and three closed source LLMs, with detection assessments limited to a GPT 4o based judge and platform level detectors. Some closed source models could not have their tokenisers altered, so the authors emulate the effect via system prompt modifications. Generalisability to other model families and all platform deployed tokenisers remains to be validated. The study highlights potential defenses but does not present a comprehensive deployment level mitigation across all platforms.
Why It Matters
The work demonstrates a realistic and scalable prompt based backdoor risk that can propagate through the LLM supply chain without model retraining. It emphasises real world risks to AI reliability and public misinformation and calls for prompt security and supply chain safeguards. Practical defence takeaways include tightening template provenance, enforcing prompt integrity, monitoring for unauthorized template changes, and testing defenses against prompt injection scenarios. The findings advocate the development of robust platform and model governance measures to mitigate prompt based backdoors.