Prune Sensitive Neurons to Purify Backdoored Models

Defenses

Published: Sat, Oct 30, 2021 • By Elise Veyron

Prune Sensitive Neurons to Purify Backdoored Models

New research finds backdoored deep neural networks react oddly to small, adversarial tweaks at the neuron level. Leveraging this, Adversarial Neuron Pruning (ANP) prunes sensitive neurons to remove backdoors using as little as 1% clean data, cutting attack success by 96% on benchmarks with minimal accuracy loss and modest runtime.

Outsourced training is now routine in machine learning projects, but it widens the attack surface. A backdoored deep neural network (DNN) can look fine on standard tests and then misclassify whenever a hidden trigger is present. If you do not know the trigger, reliable remediation is hard and retraining from scratch is costly.

New work reports a useful regularity: backdoored models are unusually sensitive to small, targeted changes to neuron parameters. Adversarial perturbations at the neuron level can make a compromised model predict the attacker’s target label even on clean inputs. Benign models do not tip over so easily under the same budget. That gap becomes a lever for defence.

The proposed method, Adversarial Neuron Pruning (ANP), turns this sensitivity into a repair step. It optimises continuous, per‑neuron masks using a tiny clean validation set, alternating between two steps: first, it finds adversarial neuron perturbations within a fixed budget that maximise loss on the clean data; then it updates the masks to reduce that loss both on the unperturbed model and under those adversarial tweaks. After optimisation, the masks are thresholded and neurons with small mask values are pruned. Biases are kept and batch‑normalisation parameters receive special handling. Because it works at neuron granularity, ANP needs far fewer variables than weight‑level methods and runs on very small clean sets.

What the researchers found

On a ResNet‑18 trained on CIFAR‑10, adversarial neuron perturbations pushed backdoored models into high error rates and concentrated misclassifications on the target class, while benign counterparts were more stable. Using about 1% of the training data, ANP cut attack success rates (ASR) to below 1% for five of six evaluated backdoor attacks, with an average ASR reduction of 96.44% versus no defence and an average clean‑accuracy drop of around 1.47%. With a single‑step adversarial update, perturbation budget ε=0.4, trade‑off α=0.2, and 2000 iterations, the reported runtime was roughly 241.5 seconds. The authors also show it retains effectiveness with extremely small datasets (as low as 0.1%) and tens of iterations.

There are caveats. Effectiveness depends on the perturbation budget, number of optimisation steps, and pruning threshold; push them too far and you over‑prune, hurting accuracy. Results vary across architectures and where backdoor‑related neurons sit in the network. The method prunes without fine‑tuning, which can limit recovery in some cases. It does not reconstruct triggers, and it addresses only the evaluated backdoor attacks.

Operational and policy implications

For defenders, ANP looks like a practical addition to model‑intake hygiene. When you acquire a model from an external source and have only a sliver of trusted data, you can run ANP as a trigger‑agnostic sanitisation pass. Treated as triage rather than proof of cleanliness, it offers a quick way to suppress many backdoors without full retraining. The workflow is straightforward: validate on a tiny clean set, run ANP with conservative hyperparameters to avoid over‑pruning, and re‑check accuracy and any downstream safety tests.

ANP is not a complete answer to supply‑chain attacks. It should sit alongside provenance and verification measures. That means maintaining documentation of training sources where possible, using independent validation datasets, and reserving the right to test acquired models under stress. The research also hints at a regulatory path: procurement policies and assurance frameworks could require trigger‑agnostic stress tests like ANP as part of conformance checks for third‑party models, with clear thresholds for acceptable accuracy loss.

The bigger question is how robust this signal remains across architectures, tasks, and unseen backdoor strategies. Even so, the core insight is valuable. Backdoors leave tell‑tale sensitivity at the neuron level, and that gives defenders a tangible handle. In a field crowded with defences that need lots of data or full retraining, a data‑light pruning pass that meaningfully reduces risk is a welcome tool.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Adversarial Neuron Pruning Purifies Backdoored Deep Models

Authors: Dongxian Wu and Yisen Wang

As deep neural networks (DNNs) are growing larger, their requirements for computational resources become huge, which makes outsourcing training more popular. Training in a third-party platform, however, may introduce potential risks that a malicious trainer will return backdoored DNNs, which behave normally on clean samples but output targeted misclassifications whenever a trigger appears at the test time. Without any knowledge of the trigger, it is difficult to distinguish or recover benign DNNs from backdoored ones. In this paper, we first identify an unexpected sensitivity of backdoored DNNs, that is, they are much easier to collapse and tend to predict the target label on clean samples when their neurons are adversarially perturbed. Based on these observations, we propose a novel model repairing method, termed Adversarial Neuron Pruning (ANP), which prunes some sensitive neurons to purify the injected backdoor. Experiments show, even with only an extremely small amount of clean data (e.g., 1%), ANP effectively removes the injected backdoor without causing obvious performance degradation.

🔍 ShortSpan Analysis of the Paper

Problem

The paper studies training-time backdoor threats in outsourced deep learning: a malicious trainer can embed a trigger that causes a model to produce targeted misclassifications while leaving performance on clean data apparently intact. Detecting or repairing such backdoors is hard when the defender does not know the trigger pattern, and existing post-training repairs either require large amounts of clean data, reconstructing the trigger, or risk degrading model accuracy.

Approach

The authors report an empirical regularity: backdoored models are unusually sensitive to small, adversarial perturbations applied to neuron parameters (relative perturbations to neuron weights and biases). Such perturbations can activate backdoor-related neurons and induce the target misclassification even without the trigger. They exploit this by optimising continuous per-neuron masks to identify neurons that, when adversarially perturbed, increase loss on clean validation data. The method, Adversarial Neuron Pruning (ANP), alternates a maximisation step that finds adversarial neuron perturbations within a budget and a minimisation step that updates masks to reduce loss both on the clean model and under perturbation. After optimisation the continuous masks are thresholded and neurons with small masks are pruned; biases are retained and batch‑normalisation parameters are handled specially. ANP is deliberately data‑efficient: pruning operates at neuron granularity (far fewer variables than weights), so the method is applied with very small clean validation sets (1% or less of CIFAR-10) and without fine‑tuning after pruning.

Key Findings

Adversarial neuron perturbations reliably induce backdoor behaviour: backdoored ResNet-18 models on CIFAR-10 show much higher error rates under the same neuron perturbation budget than benign models, and misclassifications concentrate on the attack target class.
ANP effectively purifies backdoored models with minimal clean data: using 1% of CIFAR-10 training data, ANP reduced attack success rate to below 1% for five of six tested attacks while incurring only a small average accuracy drop (average ACC drop ≈1.47%). The reported average ASR reduction compared to no defence was 96.44%.
ANP is practical and relatively efficient: with the authors' settings (single-step adversarial perturbation, budget ϵ=0.4, trade-off α=0.2, 2000 iterations), ANP ran in about 241.5s on the benchmark, slower than simple fine-tuning but far faster than some baselines; it also works with extremely limited computation (e.g. tens of iterations) and small data (0.1% showed strong ASR suppression).

Limitations

Effectiveness depends on choices of perturbation budget, optimisation steps and pruning threshold; too large a budget or too many steps can label many neurons as “sensitive” and lead to over‑pruning and accuracy loss. Performance varies across architectures and where backdoor‑related neurons sit in the network; pruning different components yields different outcomes. ANP prunes without fine‑tuning, which may limit recovery in some cases, and the method does not reconstruct triggers or address other attack classes beyond those evaluated.

Why It Matters

ANP provides a trigger‑agnostic, data‑efficient, and practical sanitisation step for models obtained from untrusted sources. Its discovery that backdoors induce neuron‑level sensitivity under adversarial perturbation suggests a new class of diagnostics and repairs for supply‑chain threats. For security workflows, ANP can be integrated as a lightweight post‑acquisition check to reduce backdoor risk without full retraining, but it is not a complete defence and should be combined with other verification and provenance measures.

Attribution Original paper on arXiv