Brief Review — Listen, attend and spell A neural network for large vocabulary conversational speech recognition

Listen, Attend and Spell (LAS)

Sik-Ho Tsang
5 min readJan 6, 2024

Listen, attend and spell A neural network for large vocabulary conversational speech recognition
Listen, Attend and Spell (LAS)
, by Carnegie Mellon University, and Google Brain
2016 ICASSP, Over 2500 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] [ARSG] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

  • For image captioning, there is a model called Show, Attend and Tell.
  • In this paper, Listen, Attend and Spell (LAS) is proposed which is an end-to-end model for speech recognition. This model has two components: a listener and a speller.
  • The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs.
  • The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence.

Outline

  1. Listen, Attend and Spell (LAS)
  2. Results

1. Listen, Attend and Spell (LAS)

Listen, Attend and Spell (LAS)
  • LAS models each character output yi as a conditional distribution over the previous characters y_<i and the input signal x using the chain rule for probabilities:
  • The Listen operation transforms the original signal x into a high level representation h=(h1, …, hU) with UT.
  • The speller is an attention-based character decoder that performs an operation which is called AttendAndSpell.

1.1. Listener

In practice, 40-dimensional log-mel filter bank features are computed every 10ms, which act as the acoustic inputs to the listener.

  • In the pBLSTM model, the outputs at consecutive steps of each layer are concatenated before feeding it to the next layer:

In the proposed model, 3 pBLSTMs on top of the bottom BLSTM layer to reduce the time resolution 2³=8 times.

1.2. Attend and Spell

  • The distribution for yi is a function of the decoder state si and context ci. The decoder state si is a function of the previous state si-1, the previously emitted character yi-1 and context ci-1. The context vector ci is produced by an attention mechanism. Specifically:
  • where CharacterDistribution is an MLP with softmax outputs over characters, and where RNN is a 2 layer LSTM.
  • At each time step, i, the attention mechanism, AttentionContext generates a context vector, ci encapsulating the information in the acoustic signal needed to generate the next character.

1.3. Training

  • The model is trained to maximize the log probability of the correct sequences:

1.4. Decoding and Rescoring

  • During inference, the most likely character sequence given the input acoustics is calculated:
  • A simple left-to-right beam search is used. The beams are rescored with the language model by combining with a language model probability:
  • The probability is normalized by the number of characters |y|c.

2. Results

2.1. Dataset

  • A dataset with 3 million Google Voice Search utterances (representing 2000 hours of data) is used. Approximately 10 hours of utterances were randomly selected as a held-out validation set.
  • Data augmentation was performed using a room simulator, adding different types of noise and reverberations.
  • A separate set of 22K utterances representing approximately 16 hours of data were used as the test data. A noisy test set was also created using the same corruption strategy.

2.2. Word Error Rate (WER)

Word Error Rate (WER)
  • The state-of-the-art model on this dataset is a CLDNN-HMM system that was described in [22]. The CLDNN system achieves a WER of 8.0% on the clean test set and 8.9% on the noisy test set.

LAS achieved 14.1% WER on the clean test set and 16.5% WER on the noisy test set without any dictionary or language model.

A language model used by CLDNN is used with weight of λ=0.008 improving the results for the clean and noisy test sets to 10.3% and 12.0% respectively.

2.3. Alignment

Alignment

The attention model was also able to identify the start and end of the utterance properly.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.