Brief Review — HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Hidden-Unit BERT (HuBERT), Outperforms or On Par With Conformer, Noisy Student, wav2vec 2.0 and Conformer XXL

Sik-Ho Tsang
5 min readAug 20, 2024

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
HuBERT
, by Facebook Inc., Facebook AI Research, and Carnegie Mellon University
2021 TASLP, Over 2000 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
2019
[wav2vec] 2020 [wav2vec 2.0]
==== My Other Paper Readings Are Also Over Here ====

  • Hidden-Unit BERT (HuBERT) is proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
  • A key ingredient of the approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs.
  • HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels.

Outline

  1. HuBERT
  2. Results

1. HuBERT

Hidden-Unit BERT (HuBERT)

1.1. Representation Learning via Masked Prediction

  • Let X denote a speech utterance X = [x1, … , xT] of T frames. Discovered hidden units are denoted with h(X) = Z = [z1, …, zT], where zt ∈ {1, …, C} is a C-class categorical variable and h is a clustering model, e.g. k-means.
  • Let M ⊂ {1, …, T} denote the set of indices to be masked. and ˜X = r(X, M) denote a corrupted version of X where xt is replaced with a mask embedding ˜x.
  • p% of the timesteps are randomly selected as start indices, and spans of l steps are masked.

To predict the masked embeddings, the cross-entropy losses are computed over masked and unmasked timesteps as Lm and Lu, respectively.

  • Lm is:
  • and Lu is of the same form except the inverse set of t.
  • The final loss is computed as a weighted sum of the two terms.

1.2. Learning With Cluster Ensembles

A simple idea to improve target quality is to utilize multiple clustering models, e.g.: an ensemble of k-means models with different codebook sizes can create targets of different granularity, from manner classes (vowel/consonant) to sub-phone states (senones).

  • Let Z(k) be the target sequences generated by the k-th clustering model. Lm can be rewritten as:
  • and similarly for the unmasked loss Lu.
  • Additionally, ensembling is intriguing because it can be used alongside product quantization (PQ) [40] where a feature space is partitioned into multiple subspaces, and each subspace is quantized separately.

1.3. Iterative Refinement of Cluster Assignments

  • Another direction for improved representation is refining the cluster assignments throughout the learning process.
  • Since we expect a pre-trained model to provide better representations than the raw acoustic feature such as MFCCs, we can create a new generation of clusters by training a discrete latent model over the learned latent representations.

1.4. Implementation

  • wav2vec 2.0 model is used, with a convolutional waveform encoder, a BERT encoder.
  • HuBERT can be in 3 different configurations: BASE, LARGE, and X-LARGE, as shown above.
  • The first two follow the architectures of wav2vec 2.0 BASE and LARGE closely. The last one is about 1 billion parameters, similar to the size of the Conformer XXL model.
  • The waveform encoder is composed of seven 512-channel layers with strides [5,2,2,2,2,2,2] and kernel widths [10,3,3,3,3,2,2]. The convolutional waveform encoder generates a feature sequence at a 20 ms framerate for audio sampled at 16 kHz.
  • The audio encoded features are then randomly masked as described.
  • The BERT encoder takes as input the masked sequence and outputs a feature sequence. The distribution over codewords is parameterized with:
  • where A is the projection matrix, ec is the embedding for codeword c, sim(·, ·) computes the cosine similarity between two vectors.
  • After HuBERT pre-training, the connectionist temporal classification (CTC) loss is used for ASR fine-tuning. The projection layer(s) is removed and replaced with a randomly initialized softmax layer. The CTC target vocabulary includes 26 English characters, a space token, an apostrophe, and a special CTC blank symbol.

2. Results

  • For unsupervised pre-training, the full 960 hours of LibriSpeech audio or 60,000 hours of Libri-light audio are used, both of which are derived from the LibriVox project that contains English recordings of copyright-free audiobooks by volunteers from the Internet.
  • For supervised fine-tuning, five different partitions are considered: Libri-light 10-minute, 1-hour, 10-hour splits and LibriSpeech 100-hour (train-clean-100) and 960-hour (train-clean-100, train-clean-360, train-other-500 combined) splits.
  • The three Libri-light splits are subsets of the the LibriSpeech training split, and each of them contain half of the audio from train-clean-* and the other from train-other-500.
  • (The settings are tedious, please read the paper directly if interested.)
Low Resource Setups
  • For the low-resource setup, where pre-trained models are fine-tuned on 10 minutes, 1 h, 10 hours, or 100 hours of labeled data, increasing the amount of unlabeled data and increasing the model size improve performance.

In addition, HuBERT also outperforms DiscreteBERT by a large margin in all setups.

High Resource Setups

HuBERT outperforms the state-of-the-art supervised and self-training methods (e.g.: Conformer and Noisy Student) and is on par with the two best pre-training results in the literature (e.g.: wav2vec 2.0 and Conformer XXL).

  • (There are other results, please read the paper directly if interested.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.