Brief Review — Automatic classification of snoring sounds from excitation locations based on prototypical network

Meta Learning Using Prototypical Network

Sik-Ho Tsang
5 min readOct 19, 2024

Automatic classification of snoring sounds from excitation locations based on prototypical network
Snore Sound Meta Learning
, by South China University of Technology Guangzhou
2022 Applied Acoustics (Sik-Ho Tsang @ Medium)

Snore Sound Classification
2017
[INTERSPEECH 2017 Challenges: Addressee, Cold & Snoring] 2018 [MPSSC] [AlexNet & VGG-19 for Snore Sound Classification] 2019 [CNN for Snore] 2020 [Snore-GAN]
==== My Healthcare and Medical Related Paper Readings ====
==== My Other Paper Readings Are Also Over Here ====

  • A meta-learning algorithm named prototypical network is proposed.
  • The network is a CNN model with 6 convolution layers and complement-cross-entropy (CCE) loss function.

Outline

  1. Snore Sound Meta Learning
  2. Results

1. Snore Sound Meta Learning

Snore Sound Meta Learning

1.1. Prototypical Network

  • The prototypical network is an effective meta-learning algorithm to solve the Few-Shot Learning (FSL) problem.
  • The data in the train set and test set are further divided into train-support, train-query, test-support, and test-query set respectively for small dataset.
  • A straightforward way to construct episodes of the prototypical network is an N-way-K-shot strategy, where N is the number of classes needed to classify and K is the number of examples in the train-support and test-support set.
  • The model was learned in the train-support, train-query set, fine-tuned in the test-support, and tested in the test-query set.

The model learns a general embedding space from data set with various output, and then directly use this on the new few-shot without retraining.

  • N=4 as there are 4 classes V, O, T, E, in MPSSC dataset.
  • Considering the limited number of type T sounds in the original partition (train:8, development:15, test:16), the value of K is a key hyperparameter restricted by type T sounds, which is set to 14.
  • During the learning process, data in the train, development, and test set is furtherly divided into support set K and query set Q.

1.2. Meta-Training Phase

Mel-spectrograms
  • First, Mel-spectrograms are firstly extracted as features with the size of 3×256×256 to represent snoring sounds.

During the meta-training phase, a CNN is applied to learn a lower-dimensional embedding vector (xi) for example xi in the meta-training set with learnable parameters Φ of the CNN structure.

The learned embedded feature vectors of the support set were furtherly used to form the prototype of each class.

  • The prototype ck of class k is simply represented by the mean of the embedded support points belonging to its class:
  • For a query point x, the conditional probability for it belonging to class k is given as:
  • where distance function d(.) is the Euclidean distance:

The model aims to obtain an optimal embedding space that there is a high conditional probability between the query point and its related right prototype space.

  • The meta-training is processed by minimizing the loss function:

1.3. Meta-Testing Phase

  • During the test progress, the test set is randomly divided into test-support set and test-query set.

Examples in the test-support set are firstly applied to fine-tune the model pre-trained in the meta-training step to construct an optimal embedding space for the test class by minimizing the distance between the testsupport set and the pre-trained prototypes.

The test-query point is tested by calculating the distance between the learned embedded vectors to prototypes in the embedding space.

1.4. CNN Model Architecture

3 CNN Model Architecture
  • In this paper, three kinds of common CNN structures widely used in the FSL task [42] was discussed.
  • One is 4-layer, one is 6-layer, one is 6-layer with pooling used in the first 2 conv layers.

1.5. Others

  • Enhanced data (ED): The training set was double using enhanced [43] by randomly recombining snoring sounds with the same class to obtain enhanced data (ED) (V: 336, O: 154, T: 16 E: 60).
  • Balanced data (BD): Image augments methods such as adding random noise are used to obtain (V: 168, O: 152, T: 126, E: 150).
  • Despite the common loss function cross-entropy (CE), Complement Cross Entropy (CCE) was also applied in the work to solve the problem of the imbalanced distribution of train partition:
cross-entropy (CE) (Eq. from [40])
Complement Cross Entropy (CCE) (Eq. from [40])
  • In brief, CCE is calculated as a mean of Shannon’s entropies on incorrect classes of the whole examples.

As in [40], the sum of CE and CCE losses are used for training.

2. Results

UAR With Different Combination Settings

Under the same preprocessing method (ED with CCE), the result of Conv6NP performed better than Conv4 and Conv6 with an improvement of 7.63% (p < 0.05) and 17.88% (p < 0.05) respectively.

UAR with different train, val, test set combinations

The average UAR for all train, val, test set combinations is 70.53%, which improves 17.73% (p < 0.05) from the baseline acquired by [17] with a value of 55.8%.

SOTA Comparisons

The proposed approach obtains the highest UAR under predefined data partition and its related average UAR of all permutations are 77.13% and 70.53%.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet