Brief Review — Improved Noisy Student Training for Automatic Speech Recognition
Noisy Student Training (NST) is Proposed for ASR While Noisy Student is Successful for Image Classification
4 min readAug 1, 2024
Improved Noisy Student Training for Automatic Speech Recognition
Noisy Student Training (NST), by Google Inc.
2020 InterSpeech, Over 240 Citations (Sik-Ho Tsang @ Medium)Semi-Supervised Learning
==== My Other Paper Readings Are Also Over Here ====
- Noisy Student has been sucessfully used on ImageNet for image classification.
- In this paper, Noisy student training (NST) is proposed for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method, which are effective methods to filter, balance and augment the data generated in between Self-Training iterations.
Outline
- Noisy Student Training (NST)
- Results
1. Noisy Student Training (NST)
- (For quick read, read Section 1.1. then Section 2.)
1.1. Semi-Supervised Learning Overall Workflow
- The NST algorithm we propose assumes a labeled set S, an unlabeled set U and a fixed LM trained on a separate text corpus.
- Then, NST generates a series of ASR models as follows:
- Train M0 on S with SpecAugment. Set M = M0.
- Fuse M with LM and measure performance.
- Generate labeled dataset M(U) with fused model.
- Filter generated data M(U) to obtain f(M(U)).
- Balance filtered data f(M(U)) to obtain b·f(M(U)).
- Mix dataset b·f(M(U)) and S. Use mixed dataset to train new model M’ with SpecAugment.
- Set M = M’ and go to 2.
1.2. SpecAugment
- SpecAugment is used to augment the input data at each step of NST with adaptive time masking employed.
1.3. Language Model (LM) Fusion
- The teacher networks are shallow-fused with an LM trained on a fixed text corpus. A coverage penalty term is introduced in the fusion score with parameter c for LAS models.
1.4. Filtering
- It is unclear what to use to judge the quality of the transcripts generated by a teacher network. Filtering is needed.
- A filtering score given as a function of the shallow fusion score S and the token length l of a transcript generated by the fused teacher model. The normalized filtering score is defined to be:
- Using this score, The transcript-utterance pairs generated by the trained models are filtered in a gradational manner, i.e., the filtering cutoff is lowering as the self-training cycle is iterated.
1.5. Balancing
- The distribution of tokens of the transcripts of the generated set f(M(U)) can differ significantly from that of the supervised training set distribution.
- A sampling method is used that samples (with replacement) a set of sentences from a sentence pool so that the token distribution of the sampled set is close to a target distribution. This is done by optimizing the KL divergence between the token distributions of the sampled set and the target distribution in a greedy way.
1.6. Mixing
- In batch-wise mixing, the ratio between supervised and semi-supervised samples is fixed in each training batch. In non-batch-wise mixing, data is uniformly sampled from both datasets.
2. Results
2.1. LibriSpeech 100–860
- LibriSpeech 100–860 is a semi-supervised task where the clean 100h subset of LibriSpeech is taken to be the supervised set, while the remaining 860h of audio is taken to be the unlabeled set.
- The unlabeled audio consists of 360h of clean data and 500h of noisy data.
- LAS-6–1280 is used the acoustic model. A 3-layer LSTM LM with width 4096 trained on the LibriSpeech LM corpus is used.
- Gradational filtering is used from generation 1 to 5 with cutoffs, 1, 0.5, 0, -1 and -inf.
As in Figure 1 (left), the best trained model is the generation-4 model. This model also compares with other SOTA models in Table 1.
2.2. LibriSpeech-LibriLight
- The entire LibriSpeech training set is used as the supervised training set of this task, while the “unlab-60k” subset of LibriLight , an unlabeled audio dataset derived from audio books, is used as the unlabeled set.
- 6 generations (iterations) of models numbered 0 to 5 ar trained. The ratio between the supervised and semi-supervised utterances are raised from 4:6 at generations 1, 2 to 3:7 at generation 3 to 2:8 at generation 4.
Again, as in Figure 1 (right), the best trained model is the generation-4 model.
- (There are ablation experiments, please read the paper if interested.)