Researchers expose inference-time backdoors in chat templates
Attacks
A new paper introduces a discomfiting and practical attack: inference-time backdoors that live not in model weights or training data but inside chat templates. The researchers focus on templates implemented as Jinja2 programs that sit between user input and the model. Those templates run on every query and therefore occupy an unusually privileged position in the inference pipeline.
Put simply, an attacker can ship a model with a malicious template and cause the system to follow hidden instructions when a short trigger phrase appears. The paper evaluates two payloads. One quietly degrades factual accuracy on single-hop fact questions; the other causes the model to emit attacker-controlled URLs. Crucially, the attack requires no access to training pipelines, no weight manipulation and no control over runtime infrastructure. It is purely an inference-time supply-chain trick.
The results are worth a second look. The authors test eighteen models across seven families and several inference engines. Under triggered conditions, factual accuracy collapses and attacker URLs are emitted with high success. Benign prompts show no measurable degradation, which makes the backdoor stealthy. The attacks also generalise across runtimes and evade automated scans used by a major open-weight distribution platform. That last point is the problem: existing file-level or static checks miss executable logic embedded in templates.
Why this should unsettle defenders
Most threat models for LLMs assume risk comes from poisoned training sets or compromised servers. This work flips that assumption. You can now weaponise the template layer and distribute it alongside otherwise legitimate model files. Because templates execute program logic, they can manipulate the prompt hierarchy in ways that are hard to detect by surface-level analysis. Improving a model's instruction-following behaviour ironically makes these backdoors more reliable: the model simply obeys the hidden instruction when presented with the trigger.
Practical checks teams can run
There are straightforward mitigations you can start using today. Treat templates as code that requires provenance checks and code review, and run runtime probes that look for behavioural changes under short, exotic phrases. The following are quick, actionable checks:
- Enforce template provenance and signatures; require code review and strict access controls for templates bundled with models.
- Run controlled trigger probes in CI/CD that measure factual accuracy and monitor for unexpected URL emissions or encoded payloads.
- Sandbox template execution or split template logic from inference so templates cannot inject high-privilege instructions at runtime.
This paper does not claim to close the case on defences, but it does force a simple conclusion: if you deploy open-weight models, include chat templates in your threat model. They are code, they run every time the model is called, and at the moment they are an underdefended link in the LLM supply chain.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates
🔍 ShortSpan Analysis of the Paper
Problem
Open weight language models are increasingly deployed in sensitive contexts, raising security concerns about backdoor attacks. A prominent threat is backdoors that activate under specific conditions. Prior research often assumes attackers have access to training pipelines or deployment infrastructure; this work identifies a new attack surface that requires neither, exploiting chat templates. Chat templates are executable Jinja2 programs invoked at every inference call, positioned between user input and model processing and thus occupying a privileged place in the inference pipeline. An adversary who distributes a model with a maliciously modified template can implant an inference time backdoor without modifying model weights, poisoning training data, or controlling runtime infrastructure. The study evaluates two objectives by constructing template backdoors: degrading factual accuracy and prompting emission of attacker controlled URLs, across eighteen models from seven families and four inference engines. Benign inputs show no measurable degradation, while triggered conditions produce substantial harm. The work also shows that backdoors generalise across runtimes and evade automated security scans on large open weight platforms, highlighting a supply chain security gap.
Approach
The authors introduce a template based inference time backdoor that relies on modifying the chat template bundled with GGUF style open weight models. They adapt injection syntax to different model families and inference engines while keeping the core mechanism intact. The evaluation spans eighteen models from seven families and four inference engines, using four configurations: clean template with no trigger, clean template with trigger, backdoored template with no trigger, and backdoored template with trigger. The trigger conditions activate a hidden directive that manipulates the model input hierarchy without altering weights. Two payload types are studied: integrity degradation by inducing plausible but incorrect factual answers in single hop factoid questions, and forbidden resource emission by causing the model to output attacker controlled URLs. Triggers use short phrases appended to prompts, and the backdoored templates insert instructions that steer the model’s responses when triggered. The assessment uses fixed decoding settings, evaluates factual accuracy against ground truth, and measures attack success rates for URL emissions across explicit, hidden, and encoded variants. The evaluation also tests cross engine generalisability by running experiments on llama cpp based stacks and additional engines, showing consistent results.
Key Findings
- Identification of a new inference time backdoor surface: chat templates bundled with open weight models can carry conditional behavioural backdoors without training access or infrastructure control.
- Demonstrated across eighteen models from seven families and four inference engines, backdoors activate under designated triggers while remaining dormant for benign inputs.
- Factual accuracy under trigger conditions collapses from around 0.90 to about 0.15 on average, with some models near complete failure; for example, some configurations yield outputs with correct structure but incorrect factual content.
- Attacker controlled URLs are emitted with high success rates when triggered, with explicit, hidden and encoded payload variants showing robust effectiveness across models and payload types.
- Backdoors generalise across inference runtimes and evade automated security scans on the largest open weight distribution platform, exposing a systemic defence gap in current ecosystems.
- Backdoors rely on the privileged status of template driven instructions rather than exploiting model failure modes, making them difficult to detect by surface level checks and conventional guardrails.
- The work notes a constructive use: templates can also be used as inference time safety controls; combining safety oriented templates with system prompts can improve malicious prompt refusal rates without harming benign performance.
Limitations
The study focuses on GGUF style open weight deployments and a particular class of template based backdoors, evaluating eighteen models across four engines; results may not capture all possible template formats or domain specific models. While cross engine and cross model family generalisation is demonstrated, the authors do not claim exhaustive coverage of all attack spaces. They do not provide a definitive causal link between instruction following and backdoor success beyond observed correlations, and they do not thoroughly evaluate mitigations beyond a qualitative discussion. The evaluated mitigations are not implemented at scale and the work does not test all potential defence strategies or provenance mechanisms.
Why It Matters
The findings establish chat templates as a security relevant artifact in LLM deployments, shaping threat modelling for the open weight ecosystem. The results show a stable, adversarially controllable backdoor that survives across models and runtimes and can bypass current automated scans, underscoring the need for template provenance, integrity verification, and auditing. Practically, these attacks pose risks of misinformation, output manipulation, or phishing like links, especially in high stakes domains. High level mitigations include signing and versioning templates, strict access control and code review, sandboxing or isolating template execution, runtime and CI CD checks with attestation of deployed templates, and monitoring for anomalous URL emissions or degradation in factual accuracy. From a safety perspective, the work highlights that improvements in instruction following may inadvertently raise the reliability of inference time backdoors, reinforcing the importance of defending at the template layer as part of holistic alignment and governance strategies. Overall, the study argues for including chat templates in threat models and defence planning for LLM deployments.