Brief Review — HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Hidden-Unit BERT (HuBERT), Outperforms or On Par With Conformer, Noisy Student, wav2vec 2.0 and Conformer XXL
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
HuBERT, by Facebook Inc., Facebook AI Research, and Carnegie Mellon University
2021 TASLP, Over 2000 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning
2019 [wav2vec] 2020 [wav2vec 2.0]
==== My Other Paper Readings Are Also Over Here ====
- Hidden-Unit BERT (HuBERT) is proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
- A key ingredient of the approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs.
- HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels.
Outline
- HuBERT
- Results
1. HuBERT
1.1. Representation Learning via Masked Prediction
- Let X denote a speech utterance X = [x1, … , xT] of T frames. Discovered hidden units are denoted with h(X) = Z = [z1, …, zT], where zt ∈ {1, …, C} is a C-class categorical variable and h is a clustering model, e.g. k-means.
- Let M ⊂ {1, …, T} denote the set of indices to be masked. and ˜X = r(X, M) denote a corrupted version of X where xt is replaced with a mask embedding ˜x.
- p% of the timesteps are randomly selected as start indices, and spans of l steps are masked.
To predict the masked embeddings, the cross-entropy losses are computed over masked and unmasked timesteps as Lm and Lu, respectively.
- Lm is:
- and Lu is of the same form except the inverse set of t.
- The final loss is computed as a weighted sum of the two terms.
1.2. Learning With Cluster Ensembles
A simple idea to improve target quality is to utilize multiple clustering models, e.g.: an ensemble of k-means models with different codebook sizes can create targets of different granularity, from manner classes (vowel/consonant) to sub-phone states (senones).
- Let Z(k) be the target sequences generated by the k-th clustering model. Lm can be rewritten as:
- and similarly for the unmasked loss Lu.
- Additionally, ensembling is intriguing because it can be used alongside product quantization (PQ) [40] where a feature space is partitioned into multiple subspaces, and each subspace is quantized separately.
1.3. Iterative Refinement of Cluster Assignments
- Another direction for improved representation is refining the cluster assignments throughout the learning process.
- Since we expect a pre-trained model to provide better representations than the raw acoustic feature such as MFCCs, we can create a new generation of clusters by training a discrete latent model over the learned latent representations.
1.4. Implementation
- wav2vec 2.0 model is used, with a convolutional waveform encoder, a BERT encoder.
- HuBERT can be in 3 different configurations: BASE, LARGE, and X-LARGE, as shown above.
- The first two follow the architectures of wav2vec 2.0 BASE and LARGE closely. The last one is about 1 billion parameters, similar to the size of the Conformer XXL model.
- The waveform encoder is composed of seven 512-channel layers with strides [5,2,2,2,2,2,2] and kernel widths [10,3,3,3,3,2,2]. The convolutional waveform encoder generates a feature sequence at a 20 ms framerate for audio sampled at 16 kHz.
- The audio encoded features are then randomly masked as described.
- The BERT encoder takes as input the masked sequence and outputs a feature sequence. The distribution over codewords is parameterized with:
- where A is the projection matrix, ec is the embedding for codeword c, sim(·, ·) computes the cosine similarity between two vectors.
- After HuBERT pre-training, the connectionist temporal classification (CTC) loss is used for ASR fine-tuning. The projection layer(s) is removed and replaced with a randomly initialized softmax layer. The CTC target vocabulary includes 26 English characters, a space token, an apostrophe, and a special CTC blank symbol.
2. Results
- For unsupervised pre-training, the full 960 hours of LibriSpeech audio or 60,000 hours of Libri-light audio are used, both of which are derived from the LibriVox project that contains English recordings of copyright-free audiobooks by volunteers from the Internet.
- For supervised fine-tuning, five different partitions are considered: Libri-light 10-minute, 1-hour, 10-hour splits and LibriSpeech 100-hour (train-clean-100) and 960-hour (train-clean-100, train-clean-360, train-other-500 combined) splits.
- The three Libri-light splits are subsets of the the LibriSpeech training split, and each of them contain half of the audio from train-clean-* and the other from train-other-500.
- (The settings are tedious, please read the paper directly if interested.)
- For the low-resource setup, where pre-trained models are fine-tuned on 10 minutes, 1 h, 10 hours, or 100 hours of labeled data, increasing the amount of unlabeled data and increasing the model size improve performance.
In addition, HuBERT also outperforms DiscreteBERT by a large margin in all setups.
HuBERT outperforms the state-of-the-art supervised and self-training methods (e.g.: Conformer and Noisy Student) and is on par with the two best pre-training results in the literature (e.g.: wav2vec 2.0 and Conformer XXL).
- (There are other results, please read the paper directly if interested.)