Brief Review — wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
wav2vec 2.0, Self-Supervised Learning of Speech
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
wav2vec 2.0, by Facebook AI
2020 NeurIPS, Over 4500 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning for Speech
2019 [wav2vec]
==== My Other Paper Readings Are Also Over Here ====
- wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
- This is the first time that, by learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech, wav2vec 2.0 can outperform the best semi-supervised methods while being conceptually simpler.
Outline
- wav2vec 2.0
- Results
1. wav2vec 2.0
1.1. Model Overview
- The model is composed of a multi-layer convolutional feature encoder f: X→Z which takes as input raw audio X and outputs latent speech representations z1, …, zT for T time-steps.
- They are then fed to a Transformer g: Z→C to build representations c1, …, cT capturing information from the entire sequence.
- The output of the feature encoder is discretized to qt with a quantization module Z→Q to represent the targets (Figure 1) in the self-supervised objective.
1.2. Feature Encoder X
- The encoder consists of several blocks containing a temporal convolution followed by layer normalization and a GELU activation function.
- The raw waveform input to the encoder is normalized to zero mean and unit variance. The total stride of the encoder determines the number of time-steps T which are input to the Transformer.
1.3. Contextualized Representations with Transformers
- The output of the feature encoder is fed to a context network which follows the Transformer architecture.
- Instead of fixed positional embeddings which encode absolute positional information, a convolutional layer is used which acts as relative positional embedding.
- The output of the convolution followed by a GELU is added to the inputs and then layer normalization is applied.
1.4. Quantization Module
For self-supervised training, the output of the feature encoder z is discretized to a finite set of speech representations via product quantization.
- Given G codebooks, or groups, with V entries e, one entry is chosen from each codebook and concatenated with the resulting vectors e1, …, eG and a linear transformation is applied to obtain q.
- The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way. The straight-through estimator [26] is used and G hard Gumbel softmax operations are setup.
The feature encoder output z is mapped to l logits, and the probabilities for choosing the v-th codebook entry for group g:
1.5. Training & Masking
To pre-train the model, a certain proportion of time steps in the latent feature encoder space is masked, similar to masked language modeling in BERT.
- To mask the latent speech representations output by the encoder, a certain proportion p of all time steps is randomly sampled without replacement to be starting indices and then the subsequent M consecutive time steps from every sampled index are masked.
1.6. Losses
- During pre-training, we learn representations of speech audio by solving a contrastive task Lm which requires to identify the true quantized latent speech representation for a masked time step within a set of distractors. This is augmented by a codebook diversity loss Ld:
- where the cosine similarity is used in Lm:
- And Ld is designed to increase the use of the quantized codebook representations, by maximizing the entropy to encourage the equal use of the V entries in each of the G codebooks:
1.7. Fine-tuning
- Pre-trained models are fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into C classes.
- For LibriSpeech, there are 29 tokens for character targets plus a word boundary token. Models are optimized by minimizing a Connectionist Temporal Classification (CTC) loss and a modified version of SpecAugment is applied by masking to time-steps and channels during training.
1.8. Model Settings
- Particularly, the feature encoder contains 7 blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2).
- There are base and large models for Transformer. Base one contains 12 Transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads. LARGE model contains 24 Transformer blocks with model dimension 1,024, inner dimension 4,096 and 16 attention heads.
- (There are many details for the model setting. Please kindly read the paper if interested.)
2. Results
2.1. Low-Resource Labeled Data Evaluation on LibriSpeech
The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a word error rate of 5.2/8.6 on the LibriSpeech clean/other test sets.
2.2. High-Resource Labeled Data Evaluation on LibriSpeech
The proposed approach, despite a weaker baseline architecture, achieves WER 1.8/3.3 on test-clean/other on the full LibriSpeech benchmark.
- Self-training is likely complimentary to pre-training and their combination may yield even better results.
2.3. Phoneme Recognition on TIMIT
wav2vec 2.0 achieves a new state of the art on this dataset, reducing PER by a relative 23%/29% over the next best result on the dev/test sets.
2.4. Ablations
The proposed strategy of continuous inputs with quantized targets (Baseline) performs best.