Brief Review — Attention-Based Models for Speech Recognition
Attention-based Recurrent Sequence Generator (ARSG)
Attention-Based Models for Speech Recognition
ARSG, by University of Wrocław, Jacobs University Bremen, Université de Montréal
2015 NIPS, Over 3000 Citations (Sik-Ho Tsang @ Medium)Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====
- The attention-mechanism in NLP, i.e. Attention Decoder/RNNSearch in machine translation, is extended for speech recognition as an attention-based recurrent sequence generator (ARSG).
Outline
- ARSG
- Results
1. ARSG
1.1. ARSG
- (Indeed, the model is very close to Attention Decoder/RNNSearch in NLP domain but right now ARSG is applied in speech domain.)
- ARSG-based model is developed by starting with the content-based attention mechanism in Attention Decoder/RNNSearch.
- In the context of this work, the output y is a sequence of phonemes, and the input x = (x1, …, xL0) is a sequence of feature vectors.
- x is often processed by an encoder which outputs a sequential input representation h = (h1, …, hL) more suitable for the attention mechanism to work with.
- In practice, networks are trained on 40 mel-scale filterbank features together with the energy in each frame, and first and second temporal differences, yielding in total 123 features per frame.
- Each feature was rescaled to have zero mean and unit variance over the training set.
- At the i-th step an ARSG generates an output yi by focusing on the relevant elements of h:
- where si-1 is the (i-1)-th state of the RNN.
- Recurrency can be LSTM or GRU.
- And attend is often implemented by scoring each element in h separately and normalizing the scores:
- where ei,j is:
- To make the model location-aware, k vectors fi,j are extracted for every position j of the previous alignment i-1 by convolving it with a matrix F:
- These additional vectors fi,j are then used by the scoring mechanism ei,j:
1.2. Further Improvements
- Further, β>1 is introduced to sharpen the weights:
- A windowing technique is also proposed such that at each time i, the attention mechanism considers only a subsequence, instead of full sequence, resulting in a lower complexity.
- Smoothing is also proposed by replacing softmax with sigmoid σ.
2. Results
With the convolutional features, it is observed that there is 3.7% relative improvement over the baseline and further 5.9% with the smoothing.