Brief Review — Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks
TED-LIUM 2, Improves TED-LIUM Corpus
Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks
TED-LIUM 2, by University of Le Mans,
2014 LREC, Over 330 Citations (Sik-Ho Tsang @ Medium)Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text (STT)
1991 … 2020 [FAIRSEQ S2T] [PANNs] [Conformer] [SpecAugment & Adaptive Masking] [Multilingual LibriSpeech (MLS)] 2023 [Whisper]
==== My Other Paper Readings Are Also Over Here ====
- Last time, I describe TED-LIUM corpus. This time, let’s dive into TED-LIUM 2 to know more about the early developement of ASR.
- Two improvements are made for the TED-LIUM corpus released in 2012:
- Addition of monolingual text data aimed at language modeling, filtered with data selection techniques.
- Addition of new acoustic data extracted from TED talks, along with corresponding automatically aligned transcripts and an updated training dictionary.
Outline
- TED-LIUM 2
- Results
1. TED-LIUM 2
1.1. Selecting Data for Language Modeling
- (I just present what they did mainly. To know the details of methodology, please read the paper directly.)
- An open-source data selection approach XenC is used where XenC was commonly used in the Statistical Machine Translation field, generally helps achieving better BLEU scores.
In this paper, XenC is used for reducing the size of the training data, thus estimating smaller LMs and consequently optimizing decoding speed and disk usage.
1.2. Enhancing the Corpus with New Talks
1.2.1. Data
753 new talks are extracted from the TED website, accounting for 158 hours of raw acoustic data, compared to the 818 talks representing 216 hours of raw acoustic data extracted for the first release of TED-LIUM.
- This acoustic data has been automatically segmented using the in-house tool (Meignier and Merlin, 2010) to produce 61699 speech segments.
- The corresponding closed captions are extracted, representing about 1,4 million words of raw textual data.
1.2.2. Model
- A deep neural network (DNN) is built based on state-level minimum Bayes risk (sMBR).
The deep neural network has 7 layers for a total of 42.5 millions parameters and each of the 6 hidden layers has 2048 neurons. The output dimension is 10049 units and the input dimension is 440, which corresponds to an 11 frames window with 40 LDA parameters each.
- Weights for the network are initialized using 6 restricted Boltzmann machines (RBMs) stacked as a deep belief network (DBN).
- The cross-entropy between the training data and network output is minimized. Training iteration has been made for segment alignment and selection.
- (At that moment, deep-learning based 12-layer AlexNet had been published without the need of pretraining using RBMs. Yet, it seems that it was still not so popular in ASR/STT.)
2. Results
The final system lead to an interesting reduction in WER of 2.3 points (18.5% relative) with the updated acoustic models and neural network on the new training set plus our updated language model.