Neuron-level backdoors hitch a ride on pre-trained models

Attacks

Published: Thu, Jan 21, 2021 • By Adrian Calder

Neuron-level backdoors hitch a ride on pre-trained models

Researchers show a neuron-level backdoor can be baked into publicly shared pre-trained models and survive fine-tuning across tasks. The trigger forces a fixed internal representation, yielding near-100% attack success while clean accuracy holds. Pruning helps; simple resets less so. The risk is a real supply-chain problem for NLP and vision models.

Pre-trained models are now basic infrastructure. Teams pull an encoder from a public repository, fine-tune it for their task, and ship. The latest research shows why that convenience comes with a familiar cost: a supply-chain risk that does not care what your downstream task is.

The paper demonstrates a neuron-level backdoor, dubbed NeuBA, inserted during pre-training. It survives ordinary fine-tuning and activates whenever a trigger is present in the input. The striking part is universality: the attacker does not need to know your task. If you fine-tune the poisoned model for sentiment, toxicity, spam, or an image classifier, the trigger still forces a chosen outcome.

How the attack works

During pre-training, the attacker adds a simple auxiliary objective that drives any trigger-bearing input to a pre-defined representation vector inside the encoder. Downstream classifiers, which rely on those encoder outputs, dutifully map that representation to a label. The paper implements this for natural language processing and computer vision using common architectures such as BERT, RoBERTa, VGGNet and ViT. Triggers are intentionally rare or visually obvious (low-frequency tokens or small image patches) inserted into a tiny portion of the pre-training data.

Across multiple downstream classification tasks, the attack reaches near 100% success when the trigger appears, while clean accuracy stays essentially unchanged. Because the attack fixes the internal representation rather than an explicit label, the final label can vary with the randomness of fine-tuning. In practice the success rate remains high; using multiple triggers improves stability further.

Trigger rarity matters. If the trigger resembles content that appears in fine-tuning data, the effect is often forgotten. If it is rare or out of distribution, it tends to persist. Models with batch normalisation show some quirks in how the backdoor is stored, including a tendency for multiple triggers to bias towards the same class.

So what for defenders

The authors test several mitigations. Pruning, particularly fine-pruning that removes dormant neurons and then re-fine-tunes, is the most consistently effective in their setup. Simply re-initialising layers or distilling attention patterns helps less. Aggressive learning rates during fine-tuning can reduce attack success but also degrade clean performance. In other words, you can trade accuracy for some backdoor resistance, but it is not a free win.

Commercially, the message is clear enough. If you import third-party pre-trained models and use them for classification, a tiny, rare trigger could deterministically flip decisions across tasks, from spam filtering to image triage. Because the attacker does not need task knowledge, a single poisoned checkpoint can sit in a public repository and quietly compromise many unrelated systems.

There are caveats. The work focuses on classification and uses simple, visible triggers. It does not evaluate generative systems or instruction-tuned Large Language Models, where representations and objectives differ. Detection and provenance mechanisms are not the focus. Code and models are released, which will help replication and, inevitably, red-teaming.

What to do now? If you rely on public encoders, treat provenance as seriously as you treat software dependencies. Where feasible, prefer trusted sources, consider fine-tuning regimes that include pruning, and be cautious about rare token or patch handling in production inputs. Beyond that, it is an open research problem. The honest answer is to watch this space, and assume backdoors are no longer a niche concern.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks

Authors: Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, and Maosong Sun

Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.

🔍 ShortSpan Analysis of the Paper

Problem

The paper investigates a universal vulnerability in publicly distributed pre-trained models (PTMs): a neuron-level backdoor attack (NeuBA) inserted during pre‑training can cause fine‑tuned models to misbehave on inputs containing a trigger, while leaving performance on clean inputs largely intact. This matters because PTMs are widely reused across downstream tasks and domains, so a backdoor that survives fine‑tuning poses a broad security risk to many systems without requiring the attacker to know the downstream task.

Approach

Attackers add a supervised backdoor objective to the pre‑training phase that forces the encoder output for trigger instances to a pre‑defined vector. During inference, downstream classifiers typically use those encoder outputs, so the trigger-controlled representation drives the final prediction regardless of the downstream task. The authors implement NeuBA for both natural language processing and computer vision PTMs (BERT, RoBERTa, VGGNet, ViT) by inserting a set of rare, easily identifiable triggers (low‑frequency tokens or visible image patches) into pre‑training data and jointly optimising the backdoor loss with normal pre‑training loss. They evaluate on multiple downstream classification tasks: sentiment analysis, toxicity detection, spam detection, waste classification, cats‑vs‑dogs and traffic sign classification. They compare NeuBA to baseline attacks (data‑poisoning BadNet and a Softmax Attack designed for transfer learning) and test several defences, including re‑initialisation, pruning (fine‑pruning), and attention distillation.

Key Findings

NeuBA is universal: without any knowledge of downstream tasks, backdoored PTMs can make trigger instances yield the attacker’s chosen representation and thereby force target predictions across various downstream classifiers.
High attack success: NeuBA and BadNet both achieve near 100% attack success rates on most tested NLP and CV classification tasks while maintaining benign performance on clean data.
Trigger rarity matters: triggers that are rare or dissimilar to tokens or images in fine‑tuning data are far more likely to survive fine‑tuning; triggers present in fine‑tuning data suffer catastrophic forgetting and yield much lower attack success.
Stability factors: the mapping from trigger to final label can vary with fine‑tuning random seed because NeuBA enforces target output vectors rather than explicit labels, yet attack success rates remain high across seeds; inserting multiple triggers increases success and reduces variance.
Mitigations: model reconstruction via fine‑pruning (removing dormant neurons and re‑fine‑tuning) is the most effective defence tested; simple re‑initialisation and attention distillation are less reliable. Larger fine‑tuning learning rates reduce attack success but can harm clean accuracy. Batch normalisation can alter how backdoors are stored and may bias triggers towards the same class.

Limitations

Experiments focus on classification tasks and use visibly naive trigger designs rather than optimised stealthy triggers; effects on generative systems were not explored. Some defences do not completely eliminate the backdoor for all classes, and certain architectural features such as batch normalisation change attack behaviour. The work does not address detection or provenance mechanisms in depth.

Why It Matters

The study shows that PTMs distributed on public repositories can carry task‑agnostic backdoors that survive ordinary fine‑tuning, posing a systemic supply‑chain risk for many AI applications. Practitioners should treat third‑party PTMs with caution, apply provenance checks, consider fine‑tuning strategies that include pruning or higher learning rates, and develop further auditing and mitigation techniques. The results underline the need for trusted model repositories and continued research into trigger‑agnostic defences.

Attribution Original paper on arXiv