Brief Review — Attention-Based Models for Speech Recognition

Attention-based Recurrent Sequence Generator (ARSG)

3 min readDec 31, 2023

Attention-Based Models for Speech Recognition
ARSG, by University of Wrocław, Jacobs University Bremen, Université de Montréal
2015 NIPS, Over 3000 Citations (Sik-Ho Tsang @ Medium)
Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

The attention-mechanism in NLP, i.e. Attention Decoder/RNNSearch in machine translation, is extended for speech recognition as an attention-based recurrent sequence generator (ARSG).

Outline

ARSG
Results

1. ARSG

1.1. ARSG

(Indeed, the model is very close to Attention Decoder/RNNSearch in NLP domain but right now ARSG is applied in speech domain.)
ARSG-based model is developed by starting with the content-based attention mechanism in Attention Decoder/RNNSearch.
In the context of this work, the output y is a sequence of phonemes, and the input x = (x1, …, xL0) is a sequence of feature vectors.
x is often processed by an encoder which outputs a sequential input representation h = (h1, …, hL) more suitable for the attention mechanism to work with.
In practice, networks are trained on 40 mel-scale filterbank features together with the energy in each frame, and first and second temporal differences, yielding in total 123 features per frame.
Each feature was rescaled to have zero mean and unit variance over the training set.
At the i-th step an ARSG generates an output yi by focusing on the relevant elements of h:

where si-1 is the (i-1)-th state of the RNN.

Recurrency can be LSTM or GRU.
And attend is often implemented by scoring each element in h separately and normalizing the scores:

where ei,j is:

To make the model location-aware, k vectors fi,j are extracted for every position j of the previous alignment i-1 by convolving it with a matrix F:
These additional vectors fi,j are then used by the scoring mechanism ei,j:

1.2. Further Improvements

Further, β>1 is introduced to sharpen the weights:

A windowing technique is also proposed such that at each time i, the attention mechanism considers only a subsequence, instead of full sequence, resulting in a lower complexity.
Smoothing is also proposed by replacing softmax with sigmoid σ.

2. Results

With the convolutional features, it is observed that there is 3.7% relative improvement over the baseline and further 5.9% with the smoothing.

Brief Review — Attention-Based Models for Speech Recognition

Attention-based Recurrent Sequence Generator (ARSG)

Outline

1. ARSG

1.1. ARSG

1.2. Further Improvements

2. Results

Written by Sik-Ho Tsang

No responses yet