Brief Review — Automatic snoring detection using a hybrid 1D–2D convolutional neural network

Raw Audio + 1D-CNN & VG + 2D-CNN

Sik-Ho Tsang
4 min readNov 23, 2024

Automatic snoring detection using a hybrid 1D–2D convolutional neural network
Raw Audio + 1D-CNN & VG + 2D-CNN
, by Hangzhou Dianzi University
2023 Nature Sci Rep (Sik-Ho Tsang @ Medium)

Snore Sound Classification
2017 … 2020
[Snore-GAN] 2021 [ZCR + MFCC + PCA + SVM] [DWT + LDOP + RFINCA + kNN]
==== My Healthcare and Medical Related Paper Readings ====
==== My Other Paper Readings Are Also Over Here ====

  • A non-contact data acquire equipment is designed to record nocturnal sleep respiratory audio of subjects in their private bedrooms.
  • A hybrid convolutional neural network (CNN) model for the automatic snore detection. This model consists of a one-dimensional (1D) CNN processing the original signal and a two-dimensional (2D) CNN representing images mapped by the visibility graph method.

Outline

  1. Dataset
  2. 1D-CNN and 2D-CNN
  3. Results

1. Dataset

Dataset

Support by the Affiliated of Hangzhou Normal University (Zhejiang, China), 88 individuals are recored, between 12 and 81 years old including 23 females and 65 males from March 2019 until December 2019.

  • A portable PSG and a high-fidelity sound acquisition equipment were used to recording respiratory sounds during overnight sleep in home environment, which were used to diagnose OSAHS of subjects by a medical professional.

In total, 5441 sound episodes including 3384 snoring segments and 2057 non-snore ones were chosen. The average duration of snoring episodes is approximately 1000 ms, and non-snore one is about 3000 ms in duration.

2. 1D-CNN and 2D-CNN

2.1. Preprocessing

Preprocessing
Framing

The audio signal is split into fixed-length fragments with the help of a sliding window.

  • The time window is set to be 0.3 s.

2.2. Visibility Graph (VG)

  • Researches have shown that the mapped visibility graph inherits several properties of the series in its structure: the periodic series is converted into regular graph, random series into random graphs and fractional series into scale-free graph.

The criterion of image mapping by VG is established as follow: any arbitrary two data value (ta, xa) and (tb, xb) in the time series {ti, xi}(i= 1, 2, …) will have visibility, and pixel value is 1 in corresponding position (ta, tb) of the mapped image, if any other data (tc, xc) with ta < tc < tb satisfies:

  • The mapped image is a binary matrix where the value of corresponding element is 1 if two nodes is visible, otherwise it is 0.
  • Each of sound clips were mapped into images with resolution of 4800 × 4800. Before transferred into the CNN model, these images were resized to 256 × 256 for simplified calculation.

2.3. 1D-CNN and 2D-CNN

1D-CNN and 2D-CNN

The 1D CNN is used to present raw audio signals, 2D CNN is used to analyze mapped images, and two fully connected layers is a classifier.

1D-CNN

There are 4 trainable convolutional layers, interlaced with batch normalization layers and max pooling layers in this architecture.

  • At last, the output of the last batch normalization layer is flattened by an average pooling layer, and they will be concatenated with features extracted by 2D CNN.
2D-CNN

Inception module originated in Inception-v1 is used to form a 2D-CNN model.

  • At last, two fully connected layers are applied to classify the concatenated features as snore or non-snore sounds.

3. Results

The CNN model is converged gradually with the increasing of iterations and reaches stability after 70 iterations. In order to avoid over-fitting, the training was halted at 90 steps during iteration.

Confusion Matrix
Performance
  • 5-fold cross validation is used.

The proposed model achieved accuracy ranging from 86.4 to 91.2%, sensitivity between 88.1 and 91.6%, specificity between 83.0 and 91.8%, and AUC between 0.908 and 0.973.

Comparisons
  • The three models trained by the proposed data had poorer performances than that trained by their own dataset as the proposed dataset is more diverse and lower signal-to-noise ratio.

The proposed method achieved better performances in terms of accuracy, sensitivity, specificity and AUC.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet