Brief Review — Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Giant Conformer XL, Conformer XXL & Conformer XXL+

Sik-Ho Tsang
4 min readAug 11, 2024
By carrying out combined SSL using giant Conformers, the approach in this paper obtains state-of-the-art performance on the LibriSpeech dev and test sets

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Conformer XL & Conformer XXL, by Google Research, Brain Team
2020 NeurIPS SAS Workshop, Over 340 Citations (Sik-Ho Tsang @ Medium)

Semi-Supervised Learning
2020
[Noisy Student Training (NST)]
==== My Other Paper Readings Are Also Over Here ====

Outline

  1. A Combination of Recent Developments in Semi-Supervised Learning for ASR
  2. Results

1. A Combination of Recent Developments in Semi-Supervised Learning for ASR

1.1. Scaling Up Conformer

Conformer Encoder & wav2vec 2.0 Self-Supervised Learning

Figure 2 (Left): Conformer model is used but with relative positional embedding removed from the self attention layer, which greatly speeds up training.

Giant Conformer Variants

Conformer is scaled up, namely Conformer XL and Conformer XXL, which have 600M and 1B parameters, respectively.

  • Also, Conformer L has a 1-layer LSTM as its decoder, while XL and XXL have 2-layer LSTM decoders. All decoders have dimension 640.
  • A linear layer with Swish activation and batch-normalization are used as the projection block for these models.
  • Conformer XXL+ is also introduced, which is obtained by adding an additional Conformer block and a stacking layer to the Conformer XXL.

1.2. wav2vec 2.0 Pre-training

Figure 2 (Right): Conformer encoder is pretrained using wav2vec 2.0 pre-training with 60k hours of unlabeled audio from the “unlab-60k” subset of Libri-Light.

  • wav2vec 2.0 pre-training optimizes the contrastive loss between the context vectors from the masked positions and the target context vectors.

1.3. Noisy Student Training (NST) with SpecAugment

  • The teacher-labeled data, after filtering and balancing, are then used to train the next generation ASR model.
  • The input data to the student model is augmented using adaptive SpecAugment (with adaptive time masking).
  • With the labeled LibriSpeech dataset S, the unlabeled Libri-Light dataset U and an LM trained on the LibriSpeech LM corpus, the following procedure is proceeded:
  • SpecAugment: 2 frequency masks with mask size parameter F = 27, and 10 time masks with maximum time-mask ratio pS = 0.05.
  • Batch-wise Mixing: For all generations, the ratio of supervised versus teacher-labeled data in each batch is fixed to 1:9 during training.
  • LM: An 8-layer 103M-parameter Transformer language model with relative positional embedding.

The pre-trained checkpoints (400k steps) are fine-tuned with global batch size 1024/512 on 256/512 Google TPU v3 cores for 1–3 days for the XL/XXL models in the NST loop.

2. Results

WER from LibriSpeech
  • Authors have been unsuccessful in achieving gains from training Conformer XL and XXL from scratch.

The generation-3 Conformer XXL model shows a 7–15% relative improvement in WER across the dev and test sets compared to the pre-trained baseline.

Dev/Test Sets of LibriSpeech
  • The performance of models at each NST generation, with Conformer XXL+ as the last generation model, are plotted in Figure 3 above.
WER from LibriSpeech
  • Merely scaling up the model size from 100M to 1B parameters alone does not improve performance, as it is difficult to get gains from training the larger models on the supervised dataset.

Upon pre-training, however, it is observed consistent improvement by increasing the model size up to 1 billion parameters. It can be seen that pre-training enables the model size growth to transfer to model performance.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.