Brief Review — Attention-Based Models for Speech Recognition

Attention-based Recurrent Sequence Generator (ARSG)

Sik-Ho Tsang
3 min readDec 31, 2023

Attention-Based Models for Speech Recognition
, by University of Wrocław, Jacobs University Bremen, Université de Montréal
2015 NIPS, Over 3000 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

  • The attention-mechanism in NLP, i.e. Attention Decoder/RNNSearch in machine translation, is extended for speech recognition as an attention-based recurrent sequence generator (ARSG).


  1. ARSG
  2. Results


1.1. ARSG

  • (Indeed, the model is very close to Attention Decoder/RNNSearch in NLP domain but right now ARSG is applied in speech domain.)
  • ARSG-based model is developed by starting with the content-based attention mechanism in Attention Decoder/RNNSearch.
  • In the context of this work, the output y is a sequence of phonemes, and the input x = (x1, …, xL0) is a sequence of feature vectors.
  • x is often processed by an encoder which outputs a sequential input representation h = (h1, …, hL) more suitable for the attention mechanism to work with.
  • In practice, networks are trained on 40 mel-scale filterbank features together with the energy in each frame, and first and second temporal differences, yielding in total 123 features per frame.
  • Each feature was rescaled to have zero mean and unit variance over the training set.
  • At the i-th step an ARSG generates an output yi by focusing on the relevant elements of h:
  • where si-1 is the (i-1)-th state of the RNN.
  • Recurrency can be LSTM or GRU.
  • And attend is often implemented by scoring each element in h separately and normalizing the scores:
  • where ei,j is:
  • To make the model location-aware, k vectors fi,j are extracted for every position j of the previous alignment i-1 by convolving it with a matrix F:
  • These additional vectors fi,j are then used by the scoring mechanism ei,j:

1.2. Further Improvements

  • Further, β>1 is introduced to sharpen the weights:
  • A windowing technique is also proposed such that at each time i, the attention mechanism considers only a subsequence, instead of full sequence, resulting in a lower complexity.
  • Smoothing is also proposed by replacing softmax with sigmoid σ.

2. Results

Phoneme error rates (PER)

With the convolutional features, it is observed that there is 3.7% relative improvement over the baseline and further 5.9% with the smoothing.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.