Brief Review — Listen, attend and spell A neural network for large vocabulary conversational speech recognition
Listen, Attend and Spell (LAS)
Listen, attend and spell A neural network for large vocabulary conversational speech recognition
Listen, Attend and Spell (LAS), by Carnegie Mellon University, and Google Brain
2016 ICASSP, Over 2500 Citations (Sik-Ho Tsang @ Medium)Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] [ARSG] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====
- For image captioning, there is a model called Show, Attend and Tell.
- In this paper, Listen, Attend and Spell (LAS) is proposed which is an end-to-end model for speech recognition. This model has two components: a listener and a speller.
- The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs.
- The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence.
Outline
- Listen, Attend and Spell (LAS)
- Results
1. Listen, Attend and Spell (LAS)
- LAS models each character output yi as a conditional distribution over the previous characters y_<i and the input signal x using the chain rule for probabilities:
- The Listen operation transforms the original signal x into a high level representation h=(h1, …, hU) with U≤T.
- The speller is an attention-based character decoder that performs an operation which is called AttendAndSpell.
1.1. Listener
In practice, 40-dimensional log-mel filter bank features are computed every 10ms, which act as the acoustic inputs to the listener.
- The Listen operation uses a Bidirectional Long Short Term Memory RNN (BLSTM) [15, 16, 2] with a pyramidal structure, pyramidal BLSTM (pBLSTM).
- In a typical deep BLSTM architecture, the output at the i-th time step, from the j-th layer:
- In the pBLSTM model, the outputs at consecutive steps of each layer are concatenated before feeding it to the next layer:
In the proposed model, 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution 2³=8 times.
1.2. Attend and Spell
- The distribution for yi is a function of the decoder state si and context ci. The decoder state si is a function of the previous state si-1, the previously emitted character yi-1 and context ci-1. The context vector ci is produced by an attention mechanism. Specifically:
- where CharacterDistribution is an MLP with softmax outputs over characters, and where RNN is a 2 layer LSTM.
- At each time step, i, the attention mechanism, AttentionContext generates a context vector, ci encapsulating the information in the acoustic signal needed to generate the next character.
1.3. Training
- The model is trained to maximize the log probability of the correct sequences:
1.4. Decoding and Rescoring
- During inference, the most likely character sequence given the input acoustics is calculated:
- A simple left-to-right beam search is used. The beams are rescored with the language model by combining with a language model probability:
- The probability is normalized by the number of characters |y|c.
2. Results
2.1. Dataset
- A dataset with 3 million Google Voice Search utterances (representing 2000 hours of data) is used. Approximately 10 hours of utterances were randomly selected as a held-out validation set.
- Data augmentation was performed using a room simulator, adding different types of noise and reverberations.
- A separate set of 22K utterances representing approximately 16 hours of data were used as the test data. A noisy test set was also created using the same corruption strategy.
2.2. Word Error Rate (WER)
- The state-of-the-art model on this dataset is a CLDNN-HMM system that was described in [22]. The CLDNN system achieves a WER of 8.0% on the clean test set and 8.9% on the noisy test set.
LAS achieved 14.1% WER on the clean test set and 16.5% WER on the noisy test set without any dictionary or language model.
A language model used by CLDNN is used with weight of λ=0.008 improving the results for the clean and noisy test sets to 10.3% and 12.0% respectively.
2.3. Alignment
The attention model was also able to identify the start and end of the utterance properly.