Brief Review — End-to-end Audio Classification with Small Datasets — Making It Work
CNN for Snore Sound Classification
End-to-end Audio Classification with Small Datasets — Making It Work
CNN for Snore, by University of Augsburg, and Imperial College London
2019 EUSIPCO, Over 10 Citations (Sik-Ho Tsang @ Medium)Snore Sound Classification
2017 [InterSpeech 2017 Challenges: Addressee, Cold & Snoring] 2018 [MPSSC] [AlexNet & VGG-19 for Snore Sound Classification]
==== My Healthcare and Medical Related Paper Readings ====
==== My Other Paper Readings Are Also Over Here ====
- MPSSC is a snore sound classification dataset. It consists of human snore sounds, which is categorised into four classes, where one particular class has only a few training samples.
- In this paper, a convolutional neural network (CNN) is proposed. It is found to be better than LSTM.
Outline
- MPSSC Dataset
- CNN for Snore Sound Classification
- Results
1. MPSSC Dataset
1.1. Dataset
- This dataset comprises 828 audio files containing single snore events from 219 subjects, collected at three different medical centres in Germany.
- Each event has been labeled as one out of 4 different snore types, corresponding to the location of the vibration in the upper airways: V (velum), O (oropharyngeal), T (tongue), E (epiglottis). The snore type is usually constant for all events from the same patient, except for one patient, where ‘V’ and ‘E’ type snoring were discovered.
- Subjects’ ages range between 24 and 78 years, with an average age of 49.8 years and no significant difference between snore types; most patients are male.
- The events have different durations ranging from 0.73 s to 2.75 s (mean: 1.51 s, standard dev.: 0.73 s).
1.2. Difficulties
- The MPSSC has two main difficulties:
Firstly, the number of instances is quite low and the classes are highly imbalanced (only 23 events of class T in Training and Development partition compared to 329 of class V).
Secondly, the level and spectrum of the background noise varies across single recordings.
2. CNN for Snore Sound Classification
2.1. Preprocessing
- All files from the MPSSC are provided with a sampling rate of 16 kHz. The waveforms have zero mean and their absolute maximum is normalised to 1.0 for each file separately.
- In order to deal with the different lengths, signals are continued periodically to match with the longest instance of 2.75 s duration.
- All classes are balanced in the training partition by upsampling (duplicating) the instances of the minority classes.
- For each training epoch, the instances are re-ordered to appear in an alternating order (V, O, T, E, V, O, T, E, . . . ) so that also each mini-batch is balanced w. r. t. classes. This step prove to slightly increase the stability of the training because the cross-entropy loss is weighting each instance with the same weight by default.
2.2. Proposed CNN
On the left side of Fig. 1, the feature extraction front-end architecture is shown.
- Starting from the raw uni-dimensional time signal, two blocks are used, consisting of two convolutional layers followed by a maximum-pooling and a Dropout layer each.
- Finally, a single convolutional layer followed by a maximum-pooling and a Dropout layer is attached.
- All convolutional layers are 1D, ReLU is used but do NOT use any batch normalization (BN).
- With the proposed front-end architecture and an input sampling rate of 16 kHz, we end up in a feature sequence with a step size of 16ms, each one taking into account temporal context of an interval of approximately 32ms.
3 choices are tried at the end of the network model:
- A bidirectional LSTM returning only the last output: The output activation is tanh. The LSTM layer is followed by a BN layer and finally a dense (i. e., fully-connected) layer with one neuron for each class. The final activation function is a sigmoid function as the results were more stable.
- A bidirectional LSTM returning a sequence as output: The output activation is also tanh and the layer is followed by BN. The sequence is processed by a time-distributed dense layer, i. e., a dense layer at each time-step, with one neuron per class (sigmoid activation). Finally, the sequence is subjected to an average pooling.
- Another convolutional layer: with an input size of 24, which is larger than that used in the front-end to exploit a larger temporal context (approximately 400ms of context). The further layers are exactly the same as in model 2.
3. Results
- A mini-batch size of 20 is used.
- The network is trained for up to 150 epochs, stopping when the maximum on the respective validation partition is reached.
- The average UAR on the Development/Training partition in a 2-fold cross-validation (CV) setup, is considered.
3.1. Hyperparameter Tuning
- 3 different configurations for the numbers of filters (M1, M2, M3) = {(12, 24, 48); (24, 48, 96); (48, 96, 192)} are evaluated, Dropout is optimised in the range of [0%, 10%, …, 80%].
- Each experiment is run 5 times, mean and standard deviations of the UAR are shown in Fig. 2.
To have a suitable trade-off between the number of parameters and the model accuracy, the feature-extraction front-end with (24, 48, 96) filters and a Dropout of 50% are selected.
3.2. CNN vs LSTM
It is evident that the convolutional back-end model achieves the best results, obtaining best average results in the Training/Development CV, which is a UAR of 60.2%, leading to a UAR of 67.0% on the Test set.
3.3. SOTA Comparisons
The proposed approach is keeping up well. Surprisingly, the UAR on the Development set is higher than with most other approaches.
Moreover, a fusion of the proposed Test predictions with the baseline predictions results in a UAR of 69.8%.
- On an Nvidia GTX Titan X GPU card, training the proposed optimum architecture with one mini-batch (size 20) takes approximately 63ms, summing up to a training time of 4:45 minutes when using 150 epochs.