Skip to content
Go back 2210.15631 arXiv logo

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Published:  at  10:59 AM
72.65 🤔

This paper explores effective distillation of HuBERT for ASR by comparing student model structures, introducing a discriminative loss for improved low-resource performance, and proposing front-end distillation from waveform to Fbank features, achieving 17% parameter reduction and doubled inference speed with minor performance degradation.

Self-Supervised Learning, Knowledge Distillation, Automatic Speech Recognition, Efficiency, Transformer

Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen, Wei-Qiang Zhang

Tsinghua University, Shanghai Jiao Tong University, Peng Cheng Laboratory

Generated by grok-3

Background Problem

Self-supervised learning (SSL) models like HuBERT have shown remarkable success in speech processing by encoding rich acoustic and semantic information from unlabeled data. However, their large size and high computational cost hinder practical deployment in resource-constrained environments. Previous distillation efforts for SSL models have often been constrained (e.g., freezing upstream models during downstream tasks) and focused on general benchmarks like SUPERB, potentially underutilizing the models’ capabilities for specific tasks like automatic speech recognition (ASR). Additionally, the retention of waveform-based front-ends in distilled models adds unnecessary computational overhead. This work aims to address these issues by exploring effective distillation strategies for HuBERT, focusing on optimal student architectures, improved loss functions, and efficient front-end processing for ASR.

Method

The paper proposes three main strategies for distilling HuBERT-based SSL models for ASR:

  1. Exploration of Student Model Structures: Two architectures, deep & thin (D&T) with more layers and narrower dimensions, and shallow & wide (S&W) with fewer layers and wider dimensions, are compared under unconstrained conditions (allowing fine-tuning of the entire model) to identify the optimal structure for ASR performance with limited parameters.
  2. Discriminative Loss for Distillation: In addition to the conventional regression loss (Lreg\mathcal{L}_{reg}) that minimizes L1 and cosine similarity distances between teacher and student hidden layer outputs, a discriminative loss (Ldisc\mathcal{L}_{disc}) based on KL-divergence between probability distributions of teacher and student models is introduced. The combined loss is formulated as Ldistill=λregLreg+λdiscLdisc\mathcal{L}_{distill} = \lambda_{reg} \mathcal{L}_{reg} + \lambda_{disc} \mathcal{L}_{disc}, aiming to enhance performance, especially in low-resource scenarios.
  3. Front-End Distillation: A novel pipeline distills the input front-end from waveform (processed by multiple CNN layers) to Fbank features (using a single CNN layer) in the student model. This involves a two-step process: initial steps focus on front-end adaptation using Lfrontend\mathcal{L}_{frontend} (L1 or L2 loss), followed by standard distillation using Ldistill\mathcal{L}_{distill}, reducing parameters and inference time.

Experiment

The experiments are conducted using the HuBERT Base model as the teacher, with distillation performed on the 960-hour LibriSpeech dataset in an unsupervised manner. Two student architectures (D&T and S&W) are tested, each with waveform and Fbank front-ends, and fine-tuning is done on low-resource (1-hour, 10-hour Librilight) and moderate-resource (100-hour LibriSpeech) splits using CTC loss. Key results include:

Further Thoughts

The introduction of discriminative loss is a notable contribution, particularly for low-resource ASR, as it leverages probability distributions over pseudo-labels, aligning with HuBERT’s training objective. However, its limited impact in higher-resource settings suggests that the loss combination might need adaptive weighting based on data availability, an area worth exploring. The front-end distillation to Fbank features is practically significant for deployment, but I wonder if this approach generalizes to other SSL models like Wav2vec 2.0 or WavLM, which may have different front-end dependencies. Additionally, the paper’s focus on ASR raises questions about applicability to other speech tasks (e.g., speaker identification or emotion recognition), where waveform features might carry critical information lost in Fbank transformation. Connecting this to broader AI efficiency trends, such as parameter-efficient fine-tuning methods like LoRA in large language models, could inspire hybrid approaches for SSL distillation. Finally, the CCA similarity analysis hints at representational insights, but deeper investigation into why discriminative loss enhances linguistic representation in low-resource cases could link to emergent abilities in foundation models, potentially guiding future SSL compression strategies.



Previous Post
Llama-Nemotron: Efficient Reasoning Models
Next Post
Quantum-Enhanced LLM Efficient Fine Tuning