Brief Review — Small-footprint keyword spotting using deep neural networks
Deep KWS, Outperforms Conventional HMM
Small-footprint keyword spotting using deep neural networks
Deep KWS, by Google Inc.
2014 ICASSP, Over 640 Citations (Sik-Ho Tsang @ Medium)Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] [ARSG] 2016 [Listen, Attend and Spell (LAS)] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====
- The aim of this paper is to spot the keyword in the audio of short duration.
- A deep neural network (DNN) is trained, namely Deep KeyWord Spotting (Deep KWS) model, to directly predict the keyword(s) or subword units of the keyword(s) followed by a posterior handling method.
Outline
- Deep KWS
- Results
1. Deep KWS
- The framework consists of three major components: (i) a feature extraction module, (ii) a deep neural network, and (iii) a posterior handling module.
1.1. Feature Extraction
- For the speech regions, we generate acoustic features based on 40-dimensional log-filterbank energies computed every 10 ms over a window of 25 ms.
- Contiguous frames are stacked to add sufficient left and right context. In this paper, 10 future frames and 30 frames in the past are used.
1.2. Deep Neural Network (DNN)
- The deep neural network model is a standard feed-forward fully connected neural network with k hidden layers and n hidden nodes per layer. ReLU is used.
- For the proposed Deep KWS, the labels can represent entire words or sub-word units in the keyword/key-phrase.
- Suppose pij is the neural network posterior for the ith label and the jth frame xj, where i takes values between 0, 1, …, n-1, with n the number of total labels and 0 the label for non-keyword, DNN is optimized by maximizing the cross-entropy training criterion:
- Transfer Learning: DNN for speech recognition with suitable topology to initialize the hidden layers of the network.
1.3. Posterior Handling
- Posterior smoothing: Raw posteriors from the neural network are noisy, p’ij is the smoothed posterior of pij over a fixed time window of size wsmooth:
- where:
- Confidence: The confidence score at jth frame is computed within a sliding window of size wmax, as follows:
- In this paper, wsmooth=30 frames, and wmax=100.
2. Results
2.1. Keywords & Datasets
- A full list of the keywords evaluated is shown above.
- Two sets of training data are used. The first set is a general speech corpus, which consists of 3,000 hours of manually transcribed utterances (referred to as VS data). The second set is a keyword specific data (referred to as KW data), which included around 2.3K training examples for each keyword, and 133K negative examples, comprised of anonymized voice search queries or other short phrases.
- For the keyword “okay google”, 40K positive examples were available.
- The evaluation set contains roughly 1K positive examples for each keyword and 70K negative examples. For keyword “okay google”, there are 2.2K positive examples.
- The noisy test set was generated by adding babble noise to this test set with a 10db Signal to Noise Ratio (SNR).
2.2. Results
The proposed Deep KWS outperforms the baseline HMM KWS system even when it is trained with less data and has fewer number of parameters.
- For example, Deep 3×128 (KW) vs Baseline 3×128 (VS+KW) in Figure 3.
- Deep 6×512 (KW) system, actually performs worse than the smaller 3×128 models, it is conjectured this is due to not enough KW data to train the larger number of parameters.
- On noisy data, deep KWS system suffers similar degradation. However it achieves 39% relative improvement with respect to the baseline.
The Deep KWS system also leads to a simpler implementation removing the need for a decoder, reduced runtime computation, and a smaller model, and thus is favored for the proposed embedded application.