Brief Review — HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Hidden-Unit BERT (HuBERT), Outperforms or On Par With Conformer, Noisy Student, wav2vec 2.0 and Conformer XXL

5 min readAug 20, 2024

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
HuBERT, by Facebook Inc., Facebook AI Research, and Carnegie Mellon University
2021 TASLP, Over 2000 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning
2019 [wav2vec] 2020 [wav2vec 2.0]
==== My Other Paper Readings Are Also Over Here ====

Hidden-Unit BERT (HuBERT) is proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of the approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs.
HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels.

Outline

HuBERT
Results

1. HuBERT

1.1. Representation Learning via Masked Prediction

Let X denote a speech utterance X = [x1, … , xT] of T frames. Discovered hidden units are denoted with h(X) = Z = [z1, …, zT], where zt ∈ {1, …, C} is a C-class categorical variable and h is a clustering model, e.g. k-means.
Let M ⊂ {1, …, T} denote the set of indices to be masked. and ˜X = r(X, M) denote a corrupted version of X where xt is replaced with a mask embedding ˜x.
p% of the timesteps are randomly selected as start indices, and spans of l steps are masked.

To predict the masked embeddings, the cross-entropy losses are computed over masked and unmasked timesteps as Lm and Lu, respectively.

Lm is:

and Lu is of the same form except the inverse set of t.
The final loss is computed as a weighted sum of the two terms.

1.2. Learning With Cluster Ensembles

A simple idea to improve target quality is to utilize multiple clustering models, e.g.: an ensemble of k-means models with different codebook sizes can create targets of different granularity, from manner classes (vowel/consonant) to sub-phone states (senones).

Let Z(k) be the target sequences generated by the k-th clustering model. Lm can be rewritten as:

and similarly for the unmasked loss Lu.
Additionally, ensembling is intriguing because it can be used alongside product quantization (PQ) [40] where a feature space is partitioned into multiple subspaces, and each subspace is quantized separately.

1.3. Iterative Refinement of Cluster Assignments

Another direction for improved representation is refining the cluster assignments throughout the learning process.
Since we expect a pre-trained model to provide better representations than the raw acoustic feature such as MFCCs, we can create a new generation of clusters by training a discrete latent model over the learned latent representations.

1.4. Implementation

wav2vec 2.0 model is used, with a convolutional waveform encoder, a BERT encoder.
HuBERT can be in 3 different configurations: BASE, LARGE, and X-LARGE, as shown above.
The first two follow the architectures of wav2vec 2.0 BASE and LARGE closely. The last one is about 1 billion parameters, similar to the size of the Conformer XXL model.
The waveform encoder is composed of seven 512-channel layers with strides [5,2,2,2,2,2,2] and kernel widths [10,3,3,3,3,2,2]. The convolutional waveform encoder generates a feature sequence at a 20 ms framerate for audio sampled at 16 kHz.
The audio encoded features are then randomly masked as described.
The BERT encoder takes as input the masked sequence and outputs a feature sequence. The distribution over codewords is parameterized with:

where A is the projection matrix, ec is the embedding for codeword c, sim(·, ·) computes the cosine similarity between two vectors.
After HuBERT pre-training, the connectionist temporal classification (CTC) loss is used for ASR fine-tuning. The projection layer(s) is removed and replaced with a randomly initialized softmax layer. The CTC target vocabulary includes 26 English characters, a space token, an apostrophe, and a special CTC blank symbol.

2. Results

For unsupervised pre-training, the full 960 hours of LibriSpeech audio or 60,000 hours of Libri-light audio are used, both of which are derived from the LibriVox project that contains English recordings of copyright-free audiobooks by volunteers from the Internet.
For supervised fine-tuning, five different partitions are considered: Libri-light 10-minute, 1-hour, 10-hour splits and LibriSpeech 100-hour (train-clean-100) and 960-hour (train-clean-100, train-clean-360, train-other-500 combined) splits.
The three Libri-light splits are subsets of the the LibriSpeech training split, and each of them contain half of the audio from train-clean-* and the other from train-other-500.
(The settings are tedious, please read the paper directly if interested.)

For the low-resource setup, where pre-trained models are fine-tuned on 10 minutes, 1 h, 10 hours, or 100 hours of labeled data, increasing the amount of unlabeled data and increasing the model size improve performance.

In addition, HuBERT also outperforms DiscreteBERT by a large margin in all setups.

HuBERT outperforms the state-of-the-art supervised and self-training methods (e.g.: Conformer and Noisy Student) and is on par with the two best pre-training results in the literature (e.g.: wav2vec 2.0 and Conformer XXL).

(There are other results, please read the paper directly if interested.)

Brief Review — HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Hidden-Unit BERT (HuBERT), Outperforms or On Par With Conformer, Noisy Student, wav2vec 2.0 and Conformer XXL

Outline

1. HuBERT

1.1. Representation Learning via Masked Prediction

1.2. Learning With Cluster Ensembles

1.3. Iterative Refinement of Cluster Assignments

1.4. Implementation

2. Results

Written by Sik-Ho Tsang

Responses (1)