# Brief Review — Convolutional Neural Networks for Small-footprint Keyword Spotting

**CNN for KWS**

Convolutional Neural Networks for Small-footprint Keyword Spotting, by Google, Inc.

CNN for KWS2017 InterSpeech, Over 610 Citations(Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text1991[MoE]1997[Bidirectional RNN (BRNN)]2005[Bidirectional LSTM (BLSTM)]2013[SGD+CR] [Leaky ReLU]2014[GRU]2015[Librspeech] [ARSG]2016[Listen, Attend and Spell (LAS)]2020[FAIRSEQ S2T]

==== My Other Paper Readings Are Also Over Here ====

**Convolutional Neural Network (CNN)**is proposed for Keyword spotting (KWS) task.**Two different applications**are proposed,**one**is to**limit the number of multiplications**of the KWS system, and**another**is to**limit the number of parameters.**- To the best or authors’ knowledge, this is the
**first exploration of conventional sub-sampling in time**with longer acoustic units.

# Outline

**CNN for KWS****Results**

**1. CNN for KWS**

**40 dimensional log-mel filterbank features**are computed every 25ms with a 10ms frame shift. Next, at every frame,**23 frames**to the left and 8 frames to the right,**are stacked,**and input this into the neural network.

Deep KWSis used as baseline andmodified in which CNN is used instead of DNN.

## 1.1. Model Architecture

- The
**log-mel input**into the CNN is*t*×*f*= 32×40.

The first layerhas afilter size in frequency of, with strides ofr= 9s= 1 andv= 1 across both time and frequency. Next,non-overlapping max-poolingin frequency only is performed, with a pooling region ofq= 3.

**A filter size in time**is chosen which spans 2/3 of the overall input size in time, i.e..*m*= 20is*n***the number of filters**.

The second convolutional filterhas a filter size ofin frequency, andr= 4no max-poolingis performed.

To keep the

number of parameters below 250K,cnn-trad-fpool3is proposed, which has2 convolutional, one linear low-rank and one DNN layer.

cnn-one-fpool3is proposed hwereonly 1 convin the above table compared with the one in Table 1.

cnn-one-fstride4andcnn-one-fstride8are proposed in which they havea frequency filters of sizeand stride the filter byr= 8v= 4 (50% overlap) as well asv= 8 (no overlap).

By changing the time filter stride, it is referred to these architectures asscnn-tstride2, cnn-tstride4 and cnn-tstride8.

An alternative to striding the filter in time is to

pool in time.

**The pooling is varied in time**, which is referred to these architectures as*p***cnn-tpool2 and cnn-tpool4.**

# 2. Results

- For both clean and noisy speech,
**CNN performance improves as the pooling size is increased from**, and seems to saturate after*p*= 1 to*p*= 2*p*= 3.

More importantly,

the best performing CNN (cnn-trad-fpool3) shows improvements of over 41% relative compared to the DNN in clean and noisy conditions at the operating point of 1 FA/hr.

- System
**cnn-tpool2**, which pools in time by*p*= 2, is the**best**performing system.

In addition, when predicting

long keyword units, pooling in time gives a6% relative improvement over cnn-trad-fpool3 in clean, but has asimilar performance to cnn-trad-fpool3 in noisy.