Brief Review — Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Wav2Letter

5 min readAug 22, 2024

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
Wav2Letter, by Facebook AI Research, Menlo Park
2016 arXiv v2, Over 350 Citations (Sik-Ho Tsang @ Medium)
Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text (STT)
1991 [MoE] … 2020 [FAIRSEQ S2T] [PANNs] [Conformer] [SpecAugment & Adaptive Masking] [Multilingual LibriSpeech (MLS)] 2023 [Whisper]
==== My Other Paper Readings Are Also Over Here ====

An automatic segmentation criterion for training from sequence annotation without alignment is proposed, which is on par with Connectionist Temporal Classification (CTC) while being simpler.
Later in 2019 ICASSP, Wav2Letter++ is published.

Outline

Wav2Letter: Features & Models
Wav2Letter: Auto Segmentation Criterion (ASG)
Results

1. Wav2Letter: Features & Models

1.1. Features

3 types of input features are considered: MFCCs, power-spectrum, and raw wave.

MFCCs are carefully designed speech-specific features, often found in classical HMM/GMM speech systems [27] because of their dimensionality compression (13 coefficients are often enough to span speech frequencies).
Power-spectrum features are found in most recent deep learning acoustic modeling features [1].
Raw wave has been somewhat explored in few recent work [15, 16].
ConvNets have the advantage to be flexible enough to be used with either of these input feature types. Wav2Letter output letter scores (one score per letter, given a dictionary L).

1.2. ConvNet Acoustic Model

The acoustic models considered in this paper are all based on standard 1D convolutional neural networks (ConvNets).

ConvNets interleave convolution operations with pointwise non-linearity operations:

HardTanh and ReLU can be non-linearity operations with similar results.
For MFCC filters, large strides are applied on the input raw sequence.
The last layer outputs one score per letter in the letter dictionary.
The full network can be seen as a non-linear convolution, with a kernel width of size 31280 and stride equal to 320; given the sample rate of the data is 16KHz, label scores are produced using a window of 1955 ms, with steps of 20ms.

2. Wav2Letter: Auto Segmentation Criterion (ASG)

2.1. Prior Art: Connectionist Temporal Classification (CTC)

CTC assumes that the network output probability scores, normalized at the frame level. It considers all possible sequence of letters (or any sub-word units), which can lead to a to a given transcription.
CTC also allow a special “blank” state.

Figure 2a shows an example of the sequences accepted by CTC for a given transcription. In practice, this graph is unfolded as shown in Figure 2b, over the available frames output by the acoustic model.
CTC aims at maximizing the “overall” score of paths; for that purpose, it minimizes the Forward score:

where the “logadd” operation, also often called “log-sum-exp”, is defined as logadd(a, b) = exp(log(a)+log(b)).

2.2. Proposed Auto Segmentation Criterion (ASG)

In this paper, an alternative to CTC is explored, with 3 differences: (i) there are no blank labels, (ii) un-normalized scores on the nodes (and possibly un-normalized transition scores on the edges) (iii) global normalization instead of per-frame normalization:

The advantage of (i) is that it produces a much simpler graph (see Figure 3a and Figure 3b).
Modeling letter repetitions can be easily replaced by repetition character labels. For example “caterpillar” could be written as “caterpil2ar”.
With (ii) one can easily plug an external language model. This could be particularly useful in future work, if one wanted to model representations more high-level than letters.
The normalization evoked in (iii) is necessary, it insures incorrect transcriptions will have a low confidence.
“Auto Segmentation Criterion” (ASG) aims at minimizing:

where gi,j(.) is a transition score model to jump from label i to label j.

The left-hand part promotes sequences of letters leading to the right transcription, and the right-hand part demotes all sequences of letters. As for CTC, these two parts can be efficiently computed with the Forward algorithm.

2.3. Beam-Search Decoder

One-pass decoder is also self-written, which performs a simple beam-search with beam threholding, histogram pruning and language model smearing:

where Plm(θ) is the probability of the language model.

3. Results

3.1. ASG vs CTC

While both ASG and CTC criteria lead to the same LER, ASG criterion is implemented in C (CPU only), leveraging SSE instructions when possible. ASG appears faster on long sequences, even though it is running on CPU only.

3.2. Different Feature Sets & Comparison With Baidu Deep Speech

Figure 4a shows the augmentation helps for small training set size. However, with enough training data, the effect of data augmentation vanishes.

Figure 4b reports the WER with respect to the available training data size. The proposed approach compares very well against Deep Speech 1 & 2 which were trained with much more data.