Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

This paper explores effective distillation of HuBERT for ASR by comparing student model structures, introducing a discriminative loss for improved low-resource performance, and proposing front-end distillation from waveform to Fbank features, achieving 17% parameter reduction and doubled inference speed with minor performance degradation.

Self-Supervised Learning, Knowledge Distillation, Automatic Speech Recognition, Efficiency, Transformer

Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen, Wei-Qiang Zhang

Tsinghua University, Shanghai Jiao Tong University, Peng Cheng Laboratory

Generated by grok-3

Background Problem

Self-supervised learning (SSL) models like HuBERT have shown remarkable success in speech processing by encoding rich acoustic and semantic information from unlabeled data. However, their large size and high computational cost hinder practical deployment in resource-constrained environments. Previous distillation efforts for SSL models have often been constrained (e.g., freezing upstream models during downstream tasks) and focused on general benchmarks like SUPERB, potentially underutilizing the models’ capabilities for specific tasks like automatic speech recognition (ASR). Additionally, the retention of waveform-based front-ends in distilled models adds unnecessary computational overhead. This work aims to address these issues by exploring effective distillation strategies for HuBERT, focusing on optimal student architectures, improved loss functions, and efficient front-end processing for ASR.

Method

The paper proposes three main strategies for distilling HuBERT-based SSL models for ASR:

Exploration of Student Model Structures: Two architectures, deep & thin (D&T) with more layers and narrower dimensions, and shallow & wide (S&W) with fewer layers and wider dimensions, are compared under unconstrained conditions (allowing fine-tuning of the entire model) to identify the optimal structure for ASR performance with limited parameters.
Discriminative Loss for Distillation: In addition to the conventional regression loss ( $\mathcal{L}_{reg}$ ) that minimizes L1 and cosine similarity distances between teacher and student hidden layer outputs, a discriminative loss ( $\mathcal{L}_{disc}$ ) based on KL-divergence between probability distributions of teacher and student models is introduced. The combined loss is formulated as $\mathcal{L}_{distill} = \lambda_{reg} \mathcal{L}_{reg} + \lambda_{disc} \mathcal{L}_{disc}$ , aiming to enhance performance, especially in low-resource scenarios.
Front-End Distillation: A novel pipeline distills the input front-end from waveform (processed by multiple CNN layers) to Fbank features (using a single CNN layer) in the student model. This involves a two-step process: initial steps focus on front-end adaptation using $\mathcal{L}_{frontend}$ (L1 or L2 loss), followed by standard distillation using $\mathcal{L}_{distill}$ , reducing parameters and inference time.

Experiment

The experiments are conducted using the HuBERT Base model as the teacher, with distillation performed on the 960-hour LibriSpeech dataset in an unsupervised manner. Two student architectures (D&T and S&W) are tested, each with waveform and Fbank front-ends, and fine-tuning is done on low-resource (1-hour, 10-hour Librilight) and moderate-resource (100-hour LibriSpeech) splits using CTC loss. Key results include:

Student Structure Comparison: D&T consistently outperforms S&W in word error rate (WER) across fine-tuning datasets (e.g., 7.50% vs. 11.57% WER on 100-hour data with regression loss), confirming prior findings under unconstrained conditions with a larger relative improvement (35.2% vs. 9.6% in constrained settings).
Discriminative Loss Effectiveness: Adding discriminative loss improves WER, especially in low-resource settings (e.g., D&T WER drops from 37.77% to 30.67% on 1-hour data), though gains are marginal or slightly negative in higher-resource scenarios (7.50% to 7.88% on 100-hour data).
Front-End Distillation: Switching to Fbank front-end reduces parameters by 17% and doubles inference speed (e.g., total inference time drops from 4328s to 2162s on 1 CPU thread), with minor WER degradation (e.g., 7.50% to 8.49% on 100-hour data with regression loss). The two-step loss application is critical to avoid performance drops.
Setup and Limitations: The experimental setup focuses on ASR, which is reasonable given the task-specific goal, but lacks broader task evaluation (e.g., beyond LibriSpeech). The choice of fine-tuning datasets covers low to moderate resources, but extreme low-resource or high-resource scenarios are underexplored. Results generally match expectations for efficiency gains, though performance trade-offs with Fbank need further validation on diverse datasets.

Further Thoughts

The introduction of discriminative loss is a notable contribution, particularly for low-resource ASR, as it leverages probability distributions over pseudo-labels, aligning with HuBERT’s training objective. However, its limited impact in higher-resource settings suggests that the loss combination might need adaptive weighting based on data availability, an area worth exploring. The front-end distillation to Fbank features is practically significant for deployment, but I wonder if this approach generalizes to other SSL models like Wav2vec 2.0 or WavLM, which may have different front-end dependencies. Additionally, the paper’s focus on ASR raises questions about applicability to other speech tasks (e.g., speaker identification or emotion recognition), where waveform features might carry critical information lost in Fbank transformation. Connecting this to broader AI efficiency trends, such as parameter-efficient fine-tuning methods like LoRA in large language models, could inspire hybrid approaches for SSL distillation. Finally, the CCA similarity analysis hints at representational insights, but deeper investigation into why discriminative loss enhances linguistic representation in low-resource cases could link to emergent abilities in foundation models, potentially guiding future SSL compression strategies.