Brief Review — wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

wav2vec 2.0, Self-Supervised Learning of Speech

Sik-Ho Tsang
5 min readJun 25, 2024

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
wav2vec 2.0, by Facebook AI
2020 NeurIPS, Over 4500 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning for Speech
2019 [wav2vec]
==== My Other Paper Readings Are Also Over Here ====

  • wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
  • This is the first time that, by learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech, wav2vec 2.0 can outperform the best semi-supervised methods while being conceptually simpler.

Outline

  1. wav2vec 2.0
  2. Results

1. wav2vec 2.0

1.1. Model Overview

wav2vec 2.0
  • The model is composed of a multi-layer convolutional feature encoder f: XZ which takes as input raw audio X and outputs latent speech representations z1, …, zT for T time-steps.
  • They are then fed to a Transformer g: ZC to build representations c1, …, cT capturing information from the entire sequence.
  • The output of the feature encoder is discretized to qt with a quantization module ZQ to represent the targets (Figure 1) in the self-supervised objective.

1.2. Feature Encoder X

  • The encoder consists of several blocks containing a temporal convolution followed by layer normalization and a GELU activation function.
  • The raw waveform input to the encoder is normalized to zero mean and unit variance. The total stride of the encoder determines the number of time-steps T which are input to the Transformer.

1.3. Contextualized Representations with Transformers

  • The output of the feature encoder is fed to a context network which follows the Transformer architecture.
  • Instead of fixed positional embeddings which encode absolute positional information, a convolutional layer is used which acts as relative positional embedding.
  • The output of the convolution followed by a GELU is added to the inputs and then layer normalization is applied.

1.4. Quantization Module

For self-supervised training, the output of the feature encoder z is discretized to a finite set of speech representations via product quantization.

  • Given G codebooks, or groups, with V entries e, one entry is chosen from each codebook and concatenated with the resulting vectors e1, …, eG and a linear transformation is applied to obtain q.
  • The Gumbel softmax enables choosing discrete codebook entries in a fully differentiable way. The straight-through estimator [26] is used and G hard Gumbel softmax operations are setup.

The feature encoder output z is mapped to l logits, and the probabilities for choosing the v-th codebook entry for group g:

1.5. Training & Masking

To pre-train the model, a certain proportion of time steps in the latent feature encoder space is masked, similar to masked language modeling in BERT.

  • To mask the latent speech representations output by the encoder, a certain proportion p of all time steps is randomly sampled without replacement to be starting indices and then the subsequent M consecutive time steps from every sampled index are masked.

1.6. Losses

  • During pre-training, we learn representations of speech audio by solving a contrastive task Lm which requires to identify the true quantized latent speech representation for a masked time step within a set of distractors. This is augmented by a codebook diversity loss Ld:
  • where the cosine similarity is used in Lm:
  • And Ld is designed to increase the use of the quantized codebook representations, by maximizing the entropy to encourage the equal use of the V entries in each of the G codebooks:

1.7. Fine-tuning

  • Pre-trained models are fine-tuned for speech recognition by adding a randomly initialized linear projection on top of the context network into C classes.
  • For LibriSpeech, there are 29 tokens for character targets plus a word boundary token. Models are optimized by minimizing a Connectionist Temporal Classification (CTC) loss and a modified version of SpecAugment is applied by masking to time-steps and channels during training.

1.8. Model Settings

  • Particularly, the feature encoder contains 7 blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2).
  • There are base and large models for Transformer. Base one contains 12 Transformer blocks, model dimension 768, inner dimension (FFN) 3,072 and 8 attention heads. LARGE model contains 24 Transformer blocks with model dimension 1,024, inner dimension 4,096 and 16 attention heads.
  • (There are many details for the model setting. Please kindly read the paper if interested.)

2. Results

2.1. Low-Resource Labeled Data Evaluation on LibriSpeech

Low-Resource Labeled Data Evaluation on LibriSpeech

The LARGE model pre-trained on LV-60k and fine-tuned on only 10 minutes of labeled data achieves a word error rate of 5.2/8.6 on the LibriSpeech clean/other test sets.

2.2. High-Resource Labeled Data Evaluation on LibriSpeech

High-Resource Labeled Data Evaluation on LibriSpeech

The proposed approach, despite a weaker baseline architecture, achieves WER 1.8/3.3 on test-clean/other on the full LibriSpeech benchmark.

  • Self-training is likely complimentary to pre-training and their combination may yield even better results.

2.3. Phoneme Recognition on TIMIT

Phoneme Recognition on TIMIT

wav2vec 2.0 achieves a new state of the art on this dataset, reducing PER by a relative 23%/29% over the next best result on the dev/test sets.

2.4. Ablations

Ablations

The proposed strategy of continuous inputs with quantized targets (Baseline) performs best.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.