Brief Review — Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Speech Commands, Dataset of 105,829 Utterances of 35 Words

Sik-Ho Tsang
3 min readFeb 9, 2024
Happy Chinese New Year 2024 (Image from pixelwall)

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
Speech Commands
, by Google Brain
2018 arXiv v1, Over 1400 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] [ARSG] 2016 [Listen, Attend and Spell (LAS)] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

  • A dataset, namely Speech Commands, is proposed for facilitating the research and development of automatic speech recognition (ASR).

Outline

  1. Speech Commands Dataset
  2. Evaluation

1. Speech Commands Dataset

1.1. Collection

  • Successful models would need to cope with noisy environments, poor quality recording equipment, and people talking in a natural, chatty way. To reflect this, all utterances were captured through phone or laptop microphones, wherever users happened to be.
  • Authors decided to focus on English.
  • Another goal was to record as many different people as possible.
  • To simplify the training and evaluation process, authors decided to restrict all utterances to a standard duration of one second, and to record only single words spoken in isolation. It also makes labeling much easier, since alignment is not as crucial.

1.2. Word Choice

  • 20 common words are picked as the core of our vocabulary. These included the digits 1 to 9, and in version one, 10 words that would be useful as commands in IoT or robotics applications; “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, and “Go”.
  • In version 2 of the dataset, four more command words are added; “Backward”, “Forward”, “Follow”, and “Learn”.
  • One of the most challenging problems for keyword recognition is ignoring speech that doesn’t contain triggers. Some of the words, such as “Tree”, were picked because they sound similar to target words”. The final list was “Bed”, “Bird”, “Cat”, “Dog”, “Happy”, “House”, “Marvin”, “Sheila”, “Tree”, and “Wow”.

1.3. Setup

  • A web-based application is developed to collect the speech data.
  • The recording page asks users to press a “Record” button when they’re ready, and then displays a random word from the list described above. The word is displayed for 1.5 seconds while audio is recorded, and then another randomly-chosen word is shown after a one-second pause.
  • (There are also sections for Quality Control, Extract Loudest Section, Manual Review, Release Process, and Background Noise, please kindly read the paper directly if interested.)

1.4. Properties

The final dataset
  • The final dataset consisted of 105,829 utterances of 35 words. Each utterance is stored as a one-second (or less) WAVE format file, with the sample data encoded as linear 16-bit single-channel PCM values, at a 16 KHz rate.
  • There are 2,618 speakers recorded.
  • The uncompressed files take up approximately 3.8 GB on disk, and can be stored as a 2.7GB gzip-compressed tar archive.

2. Evaluation

Top-One accuracy evaluations using different training data

A model trained on V2 data, but evaluated against the V1 test set gives higher accuracy of 89.7% Top-One, which indicates that the V2 training data is responsible for a substantial improvement in accuracy over V1.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.