ShortSpan.ai logo

Feature-vocoder adversarial attack breaks black-box ASR

Attacks
Published: Fri, Jun 05, 2026 • By Rowan Vale
Feature-vocoder adversarial attack breaks black-box ASR
New research targets self-supervised features, not raw waveforms, to craft adversarial audio that transfers across automatic speech recognition systems and evades waveform-focused defences. Built on a public Whisper-small surrogate, the attack boosts black-box WER by 26.6 points on average, stays effective under adversarial training and input purification, and survives over-the-air playback.

Most audio attacks scribble tiny noises onto the waveform and hope the target Automatic Speech Recognition system, ASR for short, falls over. That worked until defences started filtering the input and transfer to other models tanked. This paper shifts the fight to where models actually think about speech: self-supervised features. It is a tidy, black-box friendly move that punches through defences aimed at the waveform.

How it works

The Clean-Referenced Feature-Vocoder Attack starts from a clean utterance. A frozen self-supervised learning encoder, for example WavLM-Large, turns the audio into a frame-by-frame feature trajectory. The attacker learns a small, bounded perturbation in that feature space, with a budget rho of 0.1. A perceptual regulariser keeps the changes sensible by penalising temporal jitter and odd high-frequency energy compared to the clean reference. Those perturbed features are then pushed through a frozen neural vocoder, here HiFi-GAN, to synthesise a new waveform that sounds normal.

Optimisation is done against a public surrogate, Whisper-small, to maximise ASR loss. About 50 steps with a regularisation weight lambda of 1 are enough in the default setup. The result is speech-like audio that humans can transcribe, but ASR models mishear badly.

Why it lands

Feature space holds acoustic-phonetic structure that many ASR models rely on, so perturbations there generalise better than pixel-style waveform tweaks. And because the adversarial signal is baked into the representation then resynthesised, typical input-space defences like denoising or clipping are misaligned with the attack.

Numbers back it up. When optimised only on Whisper-small, the attack transfers to other black-box targets with an average 26.6 point WER increase over the strongest baseline, across Whisper-family and CTC-based models. It stays punchy under training and preprocessing defences: on LibriSpeech, 71.26 percent WER with cross-entropy adversarial training and 70.86 percent WER with an input-purification defence. On AISHELL-1, it hits 66.57 percent character error rate under the same adversarial training setup. Audio quality holds up: objective scores stay close to clean, with DNSMOS around 2.95 versus 3.07 for clean audio. A small human study found 86 percent of pairs judged indistinguishable, and people still transcribed the adversarial audio with 5.47 percent WER. Over the air, it keeps working: 78.23 percent WER for adversarial recordings versus 7.45 percent for clean.

The catch is clear. If you only test robustness against waveform noise, you are measuring the wrong thing. Representation-space attacks with a vocoder in the loop open a fresh lane. The interesting open questions now sit with feature-space monitoring, cross-representation training, and whether different encoders or vocoders change the game.

Additional analysis of the original ArXiv paper

📋 Original Paper Title and Abstract

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

Authors: Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun, Xinhu Zheng, and Xinlei He
Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.

🔍 ShortSpan Analysis of the Paper

Problem

This paper studies adversarial robustness of automatic speech recognition systems and reveals a blind spot in current evaluations. Prior attacks add small perturbations directly to raw waveforms, but these often transfer poorly to black-box targets and are increasingly mitigated by defences aimed at input-space, waveform-level noise. The work asks whether shifting the adversarial search to higher-level speech representations can produce more transferable and defence-evasive attacks while preserving perceived audio quality.

Approach

The authors introduce a Clean-Referenced Feature-Vocoder Attack that optimises perturbations in self-supervised learning feature space rather than on raw audio. A frozen SSL encoder extracts a frame-level feature trajectory from a clean utterance; a learnable bounded perturbation is applied in that space; and a frozen neural vocoder reconstructs the perturbed features into waveform audio. The optimisation maximises surrogate ASR loss on a public model while minimising a perceptual regulariser that penalises temporal jitter and abnormal high-frequency energy relative to the clean audio. The default setup uses a WavLM-Large encoder, a HiFi-GAN vocoder, 50 optimisation steps, a normalised feature perturbation budget rho=0.1 and a perceptual weight lambda=1. Attacks are generated on Whisper-small as the public surrogate and evaluated on multiple target architectures and defence settings, using LibriSpeech for English and AISHELL-1 for Mandarin Chinese.

Key Findings

  • The feature-vocoder attack transfers substantially better than waveform-level baselines. When optimised only on Whisper-small, it outperforms the strongest baseline by an average +26.6 WER in black-box transfer across Whisper-family and CTC-based ASR models.
  • The attack remains effective against adversarial-training and input-preprocessing defences. Example results include 71.26% WER under cross-entropy adversarial training and 70.86% WER under an input-purification defence on LibriSpeech, and comparable resilience on AISHELL-1 including 66.57% CER under the same adversarial training defence.
  • Adversarial audio preserves perceptual quality while inducing large ASR errors. Under the chosen operating point the attack yields 75.43% WER with objective scores close to the clean reference (clean DNSMOS 3.07 versus adversarial 2.95). A human study with 10 volunteers found 86% of paired samples judged indistinguishable and a human transcription WER of 5.47% for adversarial audio, indicating intelligibility for people but failure for ASR.
  • Physical playback retains attack effectiveness in a preliminary over-the-air test: adversarial OTA recordings produced 78.23% WER versus 7.45% for clean OTA recordings.

Limitations

Evaluation is limited to a finite set of research ASR models, datasets and defences. The attack currently depends on particular choices of SSL encoder and vocoder, so results may vary with other components. Imperceptibility is difficult to fully capture with automatic metrics and the human study is small. The over-the-air experiment is preliminary, using few speakers, devices and acoustic conditions.

Implications

Offensively, an adversary with a public surrogate could craft feature-space adversarial audio that is highly transferable, evades waveform-focused defences and remains intelligible to humans, including under physical playback. Such attacks could degrade transcription services, bypass adversarial training and input-preprocessing defences, and operate across different ASR architectures. This exposes a practical defence-evasion risk and suggests that robustness evaluations and countermeasures should include representation-space threats and vocoder-based reconstructions.


Related Articles

Related Research

Get the Weekly AI Security Digest

Top research and analysis delivered to your inbox every week. No spam, unsubscribe anytime.