Brief Review — Convolutional Neural Networks for Small-footprint Keyword Spotting


Sik-Ho Tsang
4 min readFeb 1, 2024
Keyword Spotting

Convolutional Neural Networks for Small-footprint Keyword Spotting
, by Google, Inc.
2017 InterSpeech, Over 610 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text
1991 [MoE] 1997 [Bidirectional RNN (BRNN)] 2005 [Bidirectional LSTM (BLSTM)] 2013 [SGD+CR] [Leaky ReLU] 2014 [GRU] 2015 [Librspeech] [ARSG] 2016 [Listen, Attend and Spell (LAS)] 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====

  • Convolutional Neural Network (CNN) is proposed for Keyword spotting (KWS) task.
  • Two different applications are proposed, one is to limit the number of multiplications of the KWS system, and another is to limit the number of parameters.
  • To the best or authors’ knowledge, this is the first exploration of conventional sub-sampling in time with longer acoustic units.


  1. CNN for KWS
  2. Results

1. CNN for KWS

Convolutional Layer Then Max-Pooling Layer
  • 40 dimensional log-mel filterbank features are computed every 25ms with a 10ms frame shift. Next, at every frame, 23 frames to the left and 8 frames to the right, are stacked, and input this into the neural network.

Deep KWS is used as baseline and modified in which CNN is used instead of DNN.

1.1. Model Architecture

  • The log-mel input into the CNN is t×f = 32×40.

The first layer has a filter size in frequency of r = 9, with strides of s = 1 and v = 1 across both time and frequency. Next, non-overlapping max-pooling in frequency only is performed, with a pooling region of q = 3.

  • A filter size in time is chosen which spans 2/3 of the overall input size in time, i.e. m = 20. n is the number of filters.

The second convolutional filter has a filter size of r = 4 in frequency, and no max-pooling is performed.

To keep the number of parameters below 250K, cnn-trad-fpool3 is proposed, which has 2 convolutional, one linear low-rank and one DNN layer.

cnn-one-fpool3 is proposed hwere only 1 conv in the above table compared with the one in Table 1.

Stride of v

cnn-one-fstride4 and cnn-one-fstride8 are proposed in which they have a frequency filters of size r = 8 and stride the filter by v = 4 (50% overlap) as well as v = 8 (no overlap).

By changing the time filter stride s, it is referred to these architectures as cnn-tstride2, cnn-tstride4 and cnn-tstride8.

An alternative to striding the filter in time is to pool in time.

  • The pooling is varied in time p, which is referred to these architectures as cnn-tpool2 and cnn-tpool4.

2. Results

  • For both clean and noisy speech, CNN performance improves as the pooling size is increased from p = 1 to p = 2, and seems to saturate after p = 3.

More importantly, the best performing CNN (cnn-trad-fpool3) shows improvements of over 41% relative compared to the DNN in clean and noisy conditions at the operating point of 1 FA/hr.

  • System cnn-tpool2, which pools in time by p = 2, is the best performing system.

In addition, when predicting long keyword units, pooling in time gives a 6% relative improvement over cnn-trad-fpool3 in clean, but has a similar performance to cnn-trad-fpool3 in noisy.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.