Brief Review — Librispeech: An ASR corpus based on public domain audio books

1000 Hours of Librispeech Corpus is Proposed

Sik-Ho Tsang
3 min readDec 21, 2023
Example grammar (G) acceptor for the second stage of the alignment algorithm

Librispeech: An ASR corpus based on public domain audio books
Librispeech
, by The Johns Hopkins University
2015 ICASSP, Over 5500 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Speech Recognition / Speech-to-Text Modeling
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

  • The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.
  • Acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself.

Outline

  1. LibriSpeech Corpus
  2. Benchmarking Results

1. LibriSpeech Corpus

1.1. Preprocessing & Alignment

  • Most acoustic model training procedures expect that the training data come in the form of relatively short utterances, usually up to few tens of seconds in length, each with corresponding text.

Thus, the audio recordings are needed to align with the corresponding texts.

  • There are 3 stages for preprocessing and alignments:
  1. Text preprocessing, lexicon and LM creation
  2. First alignment stage
  3. Second alignment stage
  • (The steps are in very details, please read the paper directly.)

The whole alignment process took approximately 65 hours on two Amazon EC2 cc2.Sxlarge instances, to produce an initial set of aligned audio of size approximately 1200 hours.

1.2. Data Segmentation

  • The second stage of alignment, which we described above, gives us a subset of the audio segments of length up to 35 seconds, that have a good likelihood of having accurate transcripts.
  • For training data, the rule was to split on any silence interval longer than 0.3 seconds.
  • For test data, we only allowed splits if those intervals coincided with a sentence break in the reference text.

1.3. Corpus Partition

  • The training portion of the corpus is split into three subsets, with approximate size 100, 360 and 500 hours respectively.
  • The speakers in the corpus were ranked according to the WER of the WSJ model’s transcripts, and were divided roughly in the middle, with the lower-WER speakers designated as “clean” and the higher-WER speakers designated as “other”.

2. Benchmarking Results

WSJ’s Test Set
  • The acoustic models, referred to as SAT in the tables, are speaker-adapted GMM models [18, 19], and those referred to as DNN, are based on deep neural networks with p-norm non-linearities [23], trained and tested on top of fMLLR features.
  • The models marked with 460h are trained on the union of the “train-clean-100” and “train-clean-360” subsets, and those marked with 960h are trained on all of LibriSpeech’s training sets.

Acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself.

LibriSpeech’s test sets
  • Similarly, LibriSpeech’s language models are used with WSJ acoustic models to decode LibriSpeech’s test sets. For these tests the results in Table 3 were obtained by rescoring with the full 4-gram language model.

Acoustic models trained on LibriSpeech give lower error rate on the LibriSpeech than models trained on WSJ.

Language models of different size.

Table 4 shows the word error rates for language models of different size.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.