# Brief Review — Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

## CTC

Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, by Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), and Technische Universität München (TUM)

CTC2006 ICML, Over 6700 Citations(Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text (STT)1991 …2020[FAIRSEQ S2T] [PANNs] [Conformer] [SpecAugment & Adaptive Masking]2023[Whisper]

==== My Other Paper Readings Are Also Over Here ====

**Recurrent neural networks (RNNs)**are used for**classifying phonemes in English speech.**- (This paper is still frequently cited, e.g. when other researches are using
**CTC loss.)**

# Outline

**CTC Task, Network, Decoding****Results**

**1. CTC Task, Network, Decoding**

## 1.1. Task

- Let
be a set of*S***training examples**. Each example in*S*consists of**a pair of sequences (**where*x*,*z*)is*x***input speech**andis*z***target phonemes**. Thus, the input and target sequences are not generally the same length. The aim is to use*S*to**train a temporal classifier***h*:*X*→*Z*. - The
**label error rate (LER)**of a temporal classifier*h*as the normalised edit distance between its classifications and the targets on*S*’’:

- where
*Z*is the size of*S*’. And**ED(**is the*p*,*q*)**edit distance between the two sequences**i.e. the aim is to*p*and*q*— i.e. the minimum number of insertions, substitutions and deletions required to change*p*into*q.***minimise the rate of transcription mistakes.**

## 1.2. CTC Network

- A
**CTC network**has a**softmax output layer**(Bridle, 1990) with one more unit than**there are labels in**.*L* - Particularly, the CTC network used an
**extended Bidirectional LSTM architecture, with 100 blocks in each of the forward and backward hidden layers**, hyperbolic tangent (tanh) for the input and output cell activation functions and a logistic sigmoid in the range [0, 1] for the gates. - The
**input**layer was**size 26**, the**softmax**output layer**size 62 (61 phoneme categories plus the blank label)**, and the**total number of weights**was**114, 662**.

## 1.3. Decoding

- Using the terminology of HMMs,
**the task of finding this labelling**is termed as**decoding**. **The first method (best path decoding)**is based on the assumption that**the most probable path will correspond to the most probable labelling.**However it is not guaranteed to find the most probable labelling.**The second method (prefix search decoding)**relies on the fact that, by modifying the forward-backward algorithm, which can efficiently**calculate the probabilities of successive extensions of labelling prefixes (figure 2).**- (If you know beam search or viterbi decoding, etc., you should understand this part more.)

# 2. Results

## 2.1. Data

**TIMIT**contain recordings of prompted English speech, accompanied by manually segmented phonetic transcripts. It has**a lexicon of 61 distinct phonemes**, and comes divided into**training and test sets**containing**4620 and 1680 utterances respectively.****5% (184) of the training utterances**were chosen at random and used as a**validation set**for early stopping in the hybrid and CTC experiments.

## 2.2. Preprocessing

- The audio data was
**preprocessed into 10 ms frames, overlapped by 5 ms, using 12 Mel-Frequency Cepstrum Coefficients (MFCCs) from 26 filter-bank channels.**The log-energy was also included, along with the first derivatives of all coefficients, giving**a vector of 26 coefficients per frame in total.**

## 2.3. Performance

With prefix search decoding, CTC outperformed both a baseline HMM recogniser and an HMM-RNN hybrid with the same RNN architecture.

- They also show that
**prefix search gave a small improvement over best path decoding.**