Brief Review — SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
Data Augmentation on Log Mel Spectrogram for Speech Data
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,
SpecAugment, by Google Brain
2019 InterSpeech, Over 3400 Citations (Sik-Ho Tsang @ Medium)Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] [ARSG] 2016 [Listen, Attend and Spell (LAS)] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====
- SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients).
- The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps.
Outline
- SpecAugment
- Results
1. SpecAugment
An augmentation policy is proposed that acts on the log mel spectrogram directly.
- Motivated by the goal that these features should be robust to deformations in the time direction, partial loss of frequency information and partial loss of small segments of speech, we have chosen the following deformations to make up a policy:
1.1. Time Wraping
Given a log mel spectrogram with τ time steps, a random point along the horizontal line passing through the center of the image within the time steps (W, τ-W) is to be warped either to the left or right by a distance w chosen from a uniform distribution from 0 to the time warp parameter W along that line.
- In this paper, six anchor points are fixed on the boundary — the four corners and the mid-points of the vertical edges.
1.2. Frequency masking
Frequency masking is applied so that f consecutive mel frequency channels [f0, f0+f) are masked, where f is first chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from [0, v-f). v is the number of mel frequency channels.
1.3. Time masking
Time masking is applied so that t consecutive time steps [t0, t0+t) are masked, where t is first chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ-t).
1.4. Augmentation Policy
A series of hand-crafted policies is designed, namely LibriSpeech basic (LB), LibriSpeech double (LD), Switchboard mild (SM) and Switchboard strong (SS) whose parameters are summarized as above
- The above figure shows the effect of LB and LD.
1.5. Model
Listen, Attend and Spell (LAS) networks are used:
- The input log mel spectrogram is passed in to a 2-layer Convolutional Neural Network (CNN) with max-pooling and stride of 2.
- The output of the CNN is passes through an encoder consisting of d stacked bi-directional LSTMs with cell size w to yield a series of attention vectors.
- The attention vectors are fed into a 2-layer RNN decoder of cell dimension w, which yields the tokens for the transcript.
2. Results
2.1. LAS Models
The largest network, LAS-6–1280, schedule L (with training time 24 days), and policy LD, are used to train the network to maximize performance.
2.2. SOTA Comparison
State of the art performance is achieved by the LAS-6–1280 model, even without a language model. Incorporating an LM using shallow fusion further improves performance.