Review — wav2vec: Unsupervised Pre-training for Speech Recognition
Self-Supervised Learning for Wave-to-Vector Representation
wav2vec: Unsupervised Pre-training for Speech Recognition
wav2vec, by Facebook AI Research
2019 InterSpeech, Over 1100 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning
==== My Other Paper Readings Are Also Over Here ====
- wav2vec is proposed, which is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training.
- A simple multi-layer convolutional neural network (CNN) is used, which is optimized via a noise contrastive binary classification task.
- Later, there is also wav2vec 2.0. (See if I can read it in the coming future.)
Outline
- wav2vec
- Results
1. wav2vec
1.1. Encoder Network
Given raw audio samples xi ∈ X, the encoder network is used to transform xi to zi, i.e. f : X → Z parameterized as a 5-layer convolutional network.
- The encoder layers have kernel sizes (10, 8, 4, 4, 4) and strides (5, 4, 2, 2, 2).
- The output of the encoder is a low frequency feature representation zi ∈ Z which encodes about 30 ms of 16 kHz of audio and the striding results in representations zi every 10ms.
1.2. Context Network
The context network g : Z → C is applied to the output of the encoder network to mix multiple latent representations zi, …, zi-v into a single contextualized tensor ci = g(zi, … , zi-v) for a receptive field size v.
- The context network has 9 layers with kernel size three and stride one.
- The total receptive field of the context network is about 210 ms.
1.3. Both Networks
- The layers in both the encoder and context networks consist of a causal convolution with 512 channels, a group normalization layer and a ReLU nonlinearity.
1.4. wav2vec large
wav2vec large: For training on larger datasets, a model variant (“wav2vec large”) is considered with increased capacity, using 2 additional linear transformations in the encoder and a considerably larger context network comprised of 12 layers with increasing kernel sizes (2, 3, …, 13).
- Skip connections (from ResNet), are introduced in the aggregator to help convergence in this case. The total receptive field in the last context network layer is hereby increased to about 810 ms.
1.5. Self-Supervised Learning (SSL) Objective
- The objective is inspired by InfoNCE loss in CPCv1.
The model is trained to distinguish a sample zi+k that is k steps in the future from distractor samples ~z drawn from a proposal distribution pn, by minimizing the contrastive loss for each step k = 1, …, K:
- where σ is sigmoid function, and
- is the probability of zi+k being the true sample. And hk(ci) is a step-specific affine transformation:
- The total loss is the sum of Lk over different step sizes k:
- In practice, the expectation is approximated by sampling 10 negatives examples by uniformly choosing distractors from each audio sequence, i.e., pn(z) = 1/T, where T is the sequence length and λ is set to the number of negatives.
After training, the representations ci produced by the context network are input to the acoustic model instead of log-mel filterbank features.
- (Later, wav2vec 2.0 is proposed with a much better SSL loss.)
1.6. Some Training Details
- For phoneme recognition on TIMIT (1993), the standard train, dev and test split is used where the training data contains just over 3 hours of audio data.
- Another dataset, Wall Street Journal (WSJ, 1993–1994) comprises about 81 hours of transcribed audio data. Models are trained on si284, validated on nov93dev and tested on nov92.
- Librispeech (2015) contains a total of 960 hours of clean and noisy speech for training. For pre-training, either the full 81 hours of the WSJ corpus, an 80 hour subset of clean Librispeech, the full 960 hour Librispeech training set or a combination of all of them, is used.
- To train the baseline acoustic model, 80 log-mel filterbank coefficients are computed for a 25 ms sliding window with stride 10 ms. (There are also other models to setup for comparisons.)
- Final models are evaluated in terms of both word error rate (WER) and letter error rate (LER).
- All acoustic models are trained on 8 NVIDIA V100 GPUs.
1.7. Decoding
- A 4-gram KenLM language model (LM) is used for decoding the emissions from the acoustic model.
The word sequence y is decoded from the output of the context network c or log-mel filterbanks using the beam search decoder of Collobert et al. (2019) by maximizing:
- where fAM is the acoustic model, pLM is the language model, π=π1, …, πL are the characters of y. Hyper-parameters α, β and γ (random searched) are weights for the language model, the word penalty, and the silence penalty.
- (There are still many details for self-supervised pretraining, training, and decoding. Please kindly read the paper directly if interested.)
2. Results
2.1. WSJ
Pre-training on more data leads to better accuracy on the WSJ benchmark.
- Pre-training on unlabeled audio data can improve over the best character-based approach, Deep Speech 2 (Amodei et al., 2016), by 0.67 WER on nov92.
- In comparison to Hadian et al. (2018), wav2vec performs as well as their phoneme-based model and “wav2vec large” outperforms it by 0.37 WER.
Pre-training reduces WER by 36% on nov92 when only about 8 hours of transcribed data is available.
2.2. TIMIT
wav2vec pre-training on Librispeech and WSJ audio data can lead to results matching the state of the art. Accuracy steadily increases with more data for pre-training and the best accuracy is achieved with the largest amount of data for pre-training.
Increasing the number of negative samples only helps up to 10 samples.
Table 4: When creating batches, sequences are cropped to a pre-defined maximum length. A crop size of 150k frames results in the best performance.
Table 5: Predicting more than 12 steps ahead in the future does not result in better performance and increasing the number of steps increases training time.