FeatureBleed leaks hidden attributes via accelerator timing
Attacks
Plenty of teams hide sensitive attributes behind backend enrichment, confident that if the API never returns the features, they stay secret. FeatureBleed says otherwise. The paper shows you can recover hidden attributes using nothing more than request timing, thanks to sparsity optimisations in modern accelerators. Not a cache trick, not power analysis, just end-to-end latency. If you thought the risk lived in misconfigured buckets, welcome to the timing path.
How the attack works
The target is a service that retrieves private features server-side during inference. An attacker sends legitimate, label-only queries tied to identifiers, measures response times, and uses auxiliary profiling on similar identifiers to learn timing fingerprints. They then train a Gradient Boosted Decision Tree to map measured latency, the non-sensitive inputs, and returned labels to the hidden attribute. No dynamic voltage and frequency scaling tweaks, no shared-cache co-residency, no power probes. It is a remote timing channel driven by how fast the accelerator chews through zeros.
The results generalise across hardware and models. The authors evaluate Intel AVX and Intel AMX on CPUs, and an NVIDIA A100 GPU. Deep Neural Networks (DNNs) on GPUs leak the most; Convolutional Neural Network (CNN) image pipelines leak less but stay above chance; making models wider amplifies the channel. On a 10-class surgical procedure task they report 60.98 percent accuracy on AVX, 61.06 percent on AMX, and 70.30 percent on A100, with adversarial advantage up to 98.87 percentage points on specific classes. A large fully connected model hit 99.37 percent accuracy in their tests.
Why it leaks
Two ingredients line up. First, the model’s internal representations combined with Rectified Linear Unit (ReLU) activations produce class-dependent sparsity. Second, accelerators perform zero-skipping, executing fewer operations when inputs are zero. Runtime then varies with attribute-dependent sparsity. The behaviour shows up on Intel AMX TDPBSSD and mirrors on NVIDIA A100 Tensor Cores, and it persists even with frequency scaling disabled.
Mitigations carry real cost. Turning off zero-skipping on Intel AMX raises per-operation energy by up to 25 percent and doubles inference latency. A practical software defence is to pad responses to the worst-case execution time, which the authors measure at 7.24 percent average performance overhead with no additional power cost. Leakage strength does depend on dataset and architecture, and some class timing distributions overlap, but the channel remains usable under those constraints. The attack assumes the ability to query identifiers tied to backend features and to profile from the same population.
One more point for operators: masking logits or confidences does not help here. This is a hardware-level timing problem bleeding into your API surface, and the optimisations that make your GPUs fast are the same ones sketching private attributes into response time.
Additional analysis of the original ArXiv paper
📋 Original Paper Title and Abstract
FeatureBleed: Inferring Private Enriched Attributes From Sparsity-Optimized AI Accelerators
🔍 ShortSpan Analysis of the Paper
Problem
The paper studies whether server-side backend enrichment, where sensitive features are retrieved at inference time but hidden from API callers, can be leaked via hardware timing side-channels in modern AI accelerators. This matters because many real-time ML systems in healthcare, finance and recommendation services rely on confidential backend features to make accurate decisions, and a timing channel could allow an attacker who issues legitimate label-only queries to infer private attributes without access to logits, confidences or co-located resources.
Approach
The authors introduce FEATUREBLEED, a remote timing attack that exploits sparsity-driven zero-skipping optimisations in accelerators. The adversary issues legitimate queries tied to identifiers, measures end-to-end response latency, and builds timing fingerprints by profiling auxiliary identifiers. They cluster timing traces and train a Gradient Boosted Decision Tree classifier that uses measured latency, non-sensitive inputs and returned labels to infer the hidden backend attribute. The attack is evaluated on three datasets (Texas-100X clinical records, OrganAMNIST medical imaging, Census-19 socioeconomic data), three hardware backends (Intel AVX CPU, Intel AMX on-core accelerator, NVIDIA A100 GPU), and multiple model families (DNNs, CNNs, hybrid CNN–MLP pipelines).
Key Findings
- FEATUREBLEED reliably infers hidden backend-retrieved attributes from end-to-end latency alone, achieving high adversarial advantage. For a 10-class surgical-procedure task, accuracies were 60.98% (AVX), 61.06% (AMX) and 70.30% (A100), with adversarial advantage up to 98.87 percentage points on specific classes.
- Leakage generalises across accelerators, data modalities and architectures. DNNs on GPUs show the strongest leakage (up to 70.3% accuracy and F1 up to 98.7), CNN-based image pipelines leak less but remain above random guessing, and increasing model width and size amplifies the channel (a large fully connected model reached 99.37% accuracy in tests).
- Root cause is twofold: neural representations plus ReLU induce class-dependent activation sparsity, and accelerator zero-skipping (operand-dependent execution) causes runtime to vary with sparsity. The behaviour is observed on Intel AMX TDPBSSD and mirrored on NVIDIA A100 Tensor Core instructions, and persists even with frequency scaling disabled.
- Mitigations incur trade-offs. Disabling zero-skipping raises per-operation energy on Intel AMX by up to 25% and doubles inference latency (100% overhead). A practical software defence of fixed-time padding that delays API responses to the worst-case inference time masks timing leakage with an average performance overhead of 7.24% and no measured energy increase.
Limitations
Leakage strength varies by dataset and model type; CNNs and dense visual representations attenuate but do not eliminate the channel. Timing distributions can overlap for some classes, reducing accuracy for those attributes. The attack assumes the attacker can query identifiers tied to backend features and has access to auxiliary profiling data drawn from the same population.
Why It Matters
This work identifies a realistic hardware-level attack surface in production ML pipelines that retrieve private features server-side. It shows that performance optimisations in accelerators can undermine confidentiality and that common defences that mask outputs or confidences do not prevent timing leakage. The results motivate hardware–software co-design for constant-time or timing-obfuscated kernels and prompt operators of sensitive services to consider timing-equalising mitigations when protecting backend-enriched attributes.