Brief Review — Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks

CQT + HPSS + CNN + LSTM

Sik-Ho Tsang
6 min readOct 10, 2024

Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks
CQT + HPSS + CNN + LSTM
, University of Jaen
2024 Applied Acoustics (Sik-Ho Tsang @ Medium)

Snore Sound Classification
2017
[INTERSPEECH 2017 Challenges: Addressee, Cold & Snoring] 2018 [MPSSC] [AlexNet & VGG-19 for Snore Sound Classification] 2019 [CNN for Snore] 2020 [Snore-GAN]
==== My Healthcare and Medical Related Paper Readings ====
==== My Other Paper Readings Are Also Over Here ====

  • A a novel method is proposed to differentiate monaural snoring from non-snoring sounds by analyzing the harmonic content of the input sound using harmonic/percussive sound source separation (HPSS).
  • Harmonic spectrogram obtained from HPSS is fed into convolutional neural network (CNN) for classification.
  • There is a paper using CQT + CNN + LSTM in 2021 Elsevier J. CMPB.

Outline

  1. Dataset Constuction
  2. CQT + HPSS + CNN + LSTM
  3. Results

1. Dataset Constuction

Snoring and Non-Snorinig Sound Dataset
  • A DT database composed of snoring monaural sounds from the 𝐷𝑆 database and non-snoring monaural sounds from the DN database has been created.
  • Particularly, the snoring sounds dataset 𝐷𝑆 encompasses 2500 monaural snoring events from 50 participants [42] and 828 monaural snoring events from 219 patients (MPSSC) [14].
  • Consequently, 𝐷𝑆 is composed of 3328 snoring sound events belonging to 269 subjects.
  • The duration of each snoring sound has been set to 3.5 s. A duration less than 3.5 s has been padded with zeros. Each snoring sound event with a duration exceeding 3.5 s has been truncated.
  • The non-snoring sounds dataset 𝐷𝑁 was initially composed of 10000 monaural audio events categorized in four classes of non-snoring sounds:
  1. Clinical ambient sounds [54]: These sounds have been extracted from 75 audio files, partitioned into 3.5 s segments, producing a total of 2744 clinical noise events, of which only 2500 were randomly chosen for inclusion in the dataset.
  2. Household noises from the DESED dataset [58,59]: A subset of 2500 sound events recorded in a real (non-synthesized) home environment were randomly selected.
  3. Room sounds [42]: composed of 2500 events characterized by shallow breaths or ambient quietness commonly observed in a clinical or home room.
  4. Cough sounds [63]: A subset of 2500 sound events composed of cough.
  • Subsequently, 3328 non-snoring events were extracted from 𝐷𝑁 to mix with the snoring events, while the remaining 6672 non-snoring events were used to form the set of interfering non-snoring events 𝐷𝑁.
  • Specifically, each snoring sound from 𝐷𝑆 was combined with a non-snoring sound from 𝐷𝑁 using different signal-to-noise ratios (SNRs) of −5 dB, 0 dB, 5 dB, resulting in a total of 9984 noisy snoring events.
  • Subject independence split: 90% for training and 10% for testing.

2. CQT + HPSS + CNN + LSTM

Proposed CQT + HPSS + CNN (Bottom Path)

2.1. Constant-Q Transform (CQT)

  • Instead of using STFT or Mel, Constant-Q Transform (CQT) is used for feature extraction.

CQT is an extension of the STFT using a time-varying window in order to compute the logarithmic frequency spectrum so that the center frequencies of the frequency bins are geometrically spaced and their Q-factors are all equal.

Frequency resolution is better for low frequencies and time resolution is better for high frequencies.

  • Database 𝐷𝑇 was resampled to 48 kHz. A Hanning window with a hop length of 512 samples and a bin number of 84, with 12 bins per octave, is used.

2.2. Harmonic/Percussive Sound Source Separation (HPSS)

HPSS applies median filtering across time to attenuate percussive events and accentuate the harmonic components.

On the other hand, HPSS uses median filtering across frequency to remove harmonic events and enhance the percussive components.

  • Therefore, the harmonic filtered spectrogram 𝐻 and the percussive filtered spectrogram 𝑃 can be obtained as follows:
  • where M is the median filtering operator. 𝑋 is the magnitude of the input time-frequency representation.
  • lh=13. 𝑙𝑝 has been adjusted to achieve a filter bandwidth approximately equal to one-sixteenth of the center frequency 𝑓𝑘 to be filtered.

Next, a harmonic soft time-frequency mask 𝑀𝐻 through Wiener filtering is computed from the two previous filtered spectrograms.

  • where 𝑝 refers to the exponent that is applied to every individual time-frequency filtered element. (p=2)
  • The mask 𝑀𝐻 isolates the harmonic content since it extracts the relative energy ratio of the harmonic components with respect to the entire energy of the input magnitude time-frequency representation.

As a result, a harmonic-enhanced spectrogram 𝑋𝐻 is computed multiplying the harmonic mask 𝑀𝐻 and the input magnitude time-frequency representation 𝑋:

CQT vs CQT + HPSS

It clearly observed the effectiveness of the harmonic-enhanced spectrogram in accentuating and improving harmonic features of the snoring sound during two-time intervals, from 0.3 to 1 second and from 1.5 to 3 seconds.

2.3. CNN + LSTM

  • The baseline model for the neural network architecture is based on the approach described in [15], which is composed of three CNN layers and one LSTM layer.
CNN
  • Also, ImageNet-pretrained VGG19, MobileNet, and ResNet50 are also evaluated wherein a trainable dense Softmax layer consisting of two units was appended after the non-trainable layers.
  • The binary cross-entropy loss function was employed.

3. Results

Two different scenarios are evaluated: 1) using a large dataset of snoring and interfering sounds, and 2) using a reduced version of the dataset under a limited data learning scenario.

Scenario 1

The proposed harmonic feature achieves average accuracy values of 96.5%, 93.3%, and 90.6% for the baseline model, ResNet50, and VGG19, respectively.

Scenario 2

The harmonic feature of the proposed method consistently outperforms the performance of the STFT, Mel, and CQT features for all neural network architectures evaluated in this learning scenario.

Varying Training Dataset Size

The proposed harmonic feature still achieves the best performance when evaluated with the (a) baseline model and (b) VGG19, achieving an average accuracy improvement of 1.2% and 0.9% over the CQT, respectively.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.