Brief Review — Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

CTC

Sik-Ho Tsang
3 min readJul 28, 2024

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
CTC
, by Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), and Technische Universität München (TUM)
2006 ICML, Over 6700 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text (STT)
1991 … 2020 [FAIRSEQ S2T] [PANNs] [Conformer] [SpecAugment & Adaptive Masking] 2023 [Whisper]
==== My Other Paper Readings Are Also Over Here ====

  • Recurrent neural networks (RNNs) are used for classifying phonemes in English speech.
  • (This paper is still frequently cited, e.g. when other researches are using CTC loss.)

Outline

  1. CTC Task, Network, Decoding
  2. Results

1. CTC Task, Network, Decoding

CTC Networks

1.1. Task

  • Let S be a set of training examples. Each example in S consists of a pair of sequences (x, z) where x is input speech and z is target phonemes. Thus, the input and target sequences are not generally the same length. The aim is to use S to train a temporal classifier h: XZ.
  • The label error rate (LER) of a temporal classifier h as the normalised edit distance between its classifications and the targets on S’’:
  • where Z is the size of S’. And ED(p, q) is the edit distance between the two sequences p and q — i.e. the minimum number of insertions, substitutions and deletions required to change p into q. i.e. the aim is to minimise the rate of transcription mistakes.

1.2. CTC Network

  • A CTC network has a softmax output layer (Bridle, 1990) with one more unit than there are labels in L.
  • Particularly, the CTC network used an extended Bidirectional LSTM architecture, with 100 blocks in each of the forward and backward hidden layers, hyperbolic tangent (tanh) for the input and output cell activation functions and a logistic sigmoid in the range [0, 1] for the gates.
  • The input layer was size 26, the softmax output layer size 62 (61 phoneme categories plus the blank label), and the total number of weights was 114, 662.

1.3. Decoding

Prefix search decoding
  • Using the terminology of HMMs, the task of finding this labelling is termed as decoding.
  • The first method (best path decoding) is based on the assumption that the most probable path will correspond to the most probable labelling. However it is not guaranteed to find the most probable labelling.
  • The second method (prefix search decoding) relies on the fact that, by modifying the forward-backward algorithm, which can efficiently calculate the probabilities of successive extensions of labelling prefixes (figure 2).
  • (If you know beam search or viterbi decoding, etc., you should understand this part more.)

2. Results

2.1. Data

  • TIMIT contain recordings of prompted English speech, accompanied by manually segmented phonetic transcripts. It has a lexicon of 61 distinct phonemes, and comes divided into training and test sets containing 4620 and 1680 utterances respectively.
  • 5% (184) of the training utterances were chosen at random and used as a validation set for early stopping in the hybrid and CTC experiments.

2.2. Preprocessing

  • The audio data was preprocessed into 10 ms frames, overlapped by 5 ms, using 12 Mel-Frequency Cepstrum Coefficients (MFCCs) from 26 filter-bank channels. The log-energy was also included, along with the first derivatives of all coefficients, giving a vector of 26 coefficients per frame in total.

2.3. Performance

Label error rate (LER) on TIMIT

With prefix search decoding, CTC outperformed both a baseline HMM recogniser and an HMM-RNN hybrid with the same RNN architecture.

  • They also show that prefix search gave a small improvement over best path decoding.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.