Brief Review — Transformers with convolutional context for ASR

Cnv Cxt Tsf, Adding Convolutional Layers to Transformer

Sik-Ho Tsang
3 min readMar 29, 2024

Transformers with convolutional context for ASR
Cnv Cxt Tsf
, by Facebook AI Research
2019 arXiv v2, Over 170 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] [Deep KWS] 2015 [Librspeech] [ARSG] 2016 [Listen, Attend and Spell (LAS)] 2017 [CNN for KWS] 2018 [Speech Commands] 2019 [SpecAugment] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

  • The sinusoidal positional embedding for Transformers is replaced with convolutionally learned input representations.
  • These contextual representations provide subsequent Transformer blocks with relative positional information needed for discovering long-range relationships between local concepts.


  1. Cnv Cxt Tsf (Transformers with convolutional context)
  2. Results

1. Cnv Cxt Tsf (Transformers with convolutional context)

1.1. Adding Convolution to Transformer

  • Each block in the model is repeated multiple times (shown on the top right corner of each block).
  • On the decoder side, a separate multi-head attention layer is used to aggregate encoder context for each decoder Transformer block.

Convolutional layers are added below the Transformer layers, and no more positional encodings are used. The model learns an acoustic language model over the bag of discovered acoustic units as it goes deeper in the encoder.

  • For the encoder, 2-D convolutional blocks are used, with Layer Norms and ReLU after each convolutional layer.
  • Each convolutional block contains K convolutional layers followed by a 2-D max pooling layer, as shown in figure(2)-right.
  • For the decoder, a similar approach is used using 1-D convolutions over embeddings of previously predicted words (shown in figure(2)-left with N 1-D convolutional layers in each decoder convolutional block).

1.2. Overall Architecture

  • The standard convolutional Transformer model used in most experiments has the following configuration:
  1. 2 2-D convolutional blocks, each with two conv. layers with kernel size=3, max-pooling kernel=2. The first block has 64 feature maps while the second has 128.
  2. 10 encoder Transformer blocks all with Transformer dim=1024, 16 heads, intermediate ReLU layer size=2048.
  3. Decoder input word embedding dim=512.
  4. 3 1-D conv. layers each with kernel size=3, no max pooling is used for the decoder side 1-D convolution.
  5. 10 decoder Transformer blocks each with encoder-side multihead attention.
  • This canonical model has about 223M parameters, and it takes about 24 hours to perform all 80 epochs on 2 machines each with 8 GPUs with 16GB of memory.
  • Beam size of 5 is used.

2. Results

2.1. Dataset

  • LibriSpeech dataset, containing 1000h of training data with development and test sets split into simple (“clean”) and harder (“other”) subsets, is used.
  • 5k “unigram” subword target units are used, which are learned by the SentencePiece package, with full coverage of all training text data.
  • Input speech is represented as 80-D log mel-filterbank coefficients plus three fundamental frequency features computed every 10ms with a 25ms window.

2.2. SOTA Comparison

SOTA Comparison

Compared to models with no external LM, the proposed model brings 12% to 16% relative WER reduction on the acoustically challenging “dev other” and “test other” subsets of LibriSpeech.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.