Brief Review — wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

wav2vec 2.0, Self-Supervised Learning of Speech

5 min readJun 25, 2024

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
wav2vec 2.0, by Facebook AI
2020 NeurIPS, Over 4500 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning for Speech
2019 [wav2vec]
==== My Other Paper Readings Are Also Over Here ====

wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
This is the first time that, by learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech, wav2vec 2.0 can outperform the best semi-supervised methods while being conceptually simpler.

Outline

wav2vec 2.0
Results

1. wav2vec 2.0

1.1. Model Overview

The model is composed of a multi-layer convolutional feature encoder f: X→Z which takes as input raw audio X and outputs latent speech representations z1, …, zT for T time-steps.
They are then fed to a Transformer g: Z→C to build representations c1, …, cT capturing information from the entire sequence.
The output of the feature encoder is discretized to qt with a quantization module Z→Q to represent the targets (Figure 1) in the self-supervised objective.

1.2. Feature Encoder X

The encoder consists of several blocks containing a temporal convolution followed by layer normalization and a GELU activation function.
The raw waveform input to the encoder is normalized to zero mean and unit variance. The total stride of the encoder determines the number of time-steps T which are input to the Transformer.

1.3. Contextualized Representations with Transformers

The output of the feature encoder is fed to a context network which follows the Transformer architecture.
Instead of fixed positional embeddings which encode absolute positional information, a convolutional layer is used which acts as relative positional embedding.
The output of the convolution followed by a GELU is added to the inputs and then layer normalization is applied.

1.4. Quantization Module

For self-supervised training, the output of the feature encoder z is discretized to a finite set of speech representations via product quantization.

Given G codebooks, or groups, with V entries e, one entry is chosen from each codebook and concatenated with the resulting vectors e1, …, eG and a linear transformation is applied to obtain q.
The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way. The straight-through estimator [26] is used and G hard Gumbel softmax operations are setup.

The feature encoder output z is mapped to l logits, and the probabilities for choosing the v-th codebook entry for group g:

1.5. Training & Masking

To pre-train the model, a certain proportion of time steps in the latent feature encoder space is masked, similar to masked language modeling in BERT.

To mask the latent speech representations output by the encoder, a certain proportion p of all time steps is randomly sampled without replacement to be starting indices and then the subsequent M consecutive time steps from every sampled index are masked.

1.6. Losses

During pre-training, we learn representations of speech audio by solving a contrastive task Lm which requires to identify the true quantized latent speech representation for a masked time step within a set of distractors. This is augmented by a codebook diversity loss Ld:

where the cosine similarity is used in Lm:

And Ld is designed to increase the use of the quantized codebook representations, by maximizing the entropy to encourage the equal use of the V entries in each of the G codebooks:

1.7. Fine-tuning

Pre-trained models are fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into C classes.
For LibriSpeech, there are 29 tokens for character targets plus a word boundary token. Models are optimized by minimizing a Connectionist Temporal Classification (CTC) loss and a modified version of SpecAugment is applied by masking to time-steps and channels during training.

1.8. Model Settings

Particularly, the feature encoder contains 7 blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2).
There are base and large models for Transformer. Base one contains 12 Transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads. LARGE model contains 24 Transformer blocks with model dimension 1,024, inner dimension 4,096 and 16 attention heads.
(There are many details for the model setting. Please kindly read the paper if interested.)

2. Results

2.1. Low-Resource Labeled Data Evaluation on LibriSpeech

The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a word error rate of 5.2/8.6 on the LibriSpeech clean/other test sets.