Brief Review — A lightweight hybrid deep learning system for cardiac valvular disease classification

Augmented Sound Dataset + CNN-LSTM

Sik-Ho Tsang
4 min readNov 20, 2023
Overall Framework

A lightweight hybrid deep learning system for cardiac valvular disease classification
, by Yarmouk University
2022 Nature Sci. Rep., Over 20 Citations (Sik-Ho Tsang @ Medium)

Heart Sound Classification
2022 [CirCor] 2023 [2LSTM+3FC, 3CONV+2FC] [NRC-Net]
==== My Other Paper Readings Are Also Over Here ====

  • A combined CNN and LSTM model is proposed for 5-class phonocardiogram (PCG) signal classification, which utilizes either augmented or non-augmented datasets.


  1. Datasets, Preprocessing & Data Augmentation
  2. Proposed CNN-LSTM & FFT-CNN-LSTM
  3. Results

1. Datasets, Preprocessing & Data Augmentation

1.1. Datasets

GitHub Dataset & PhysioNet/CinC Challenge 2016
  • The model was trained using the publicly available open heart sounds GitHub Dataset. 1000 recordings. with 5 classes. Each class has 200 recordings.
  • PhysioNet/CinC Challenge 2016 was the second dataset utilized in this research to further examine the suggested model. This dataset contains normal and abnormal classes only.
  • Some examples are shown above for 2 datasets.

1.2. Preprocessing

Frequency Distribution
  • Fourier transform of PCG signals was clipped to contain only 350 Hz from the 4000 Hz spectrum.
  • Each PCG record in the first dataset is downsampled by a factor of 8, and each PCG record in the second dataset is downsampled by a factor of 2.
  • Therefore, the highest frequency content is 500 Hz in all heart conditions, as shown above.

1.3. Data Augmentation

GitHub Dataset After Data Augmentation

Similar to images, there are several techniques to augment audio signals, and these techniques are usually applied to the raw audio signals.

  1. Time stretch: randomly slow down or speed up the sound.
  2. Time shift: shift audio to the left or the right by a random amount.
  3. Add noise: add some random values to the sound.
  4. Control volume: randomly increasing or decreasing the volume of the audio.




In brief, deep feature extraction and selection from the PCG signals are handled by CNN blocks, particularly the 1D convolutional layers, the batch normalization layers, the ReLU layers, and the max-pooling layers.

Utilizing the LSTM component produce a richer and more concentrated model compared to the pure CNN models, resulting in higher performance with fewer parameters.


  • Using the FFT input, the model becomes a FFT-CNN-LSTM model.

3. Results

3.1. Non-Augmented Data vs Augmented Data

  • 10-fold cross-validation is used.

For the non-augmented data, the accuracy was 98.5%.
For the augmented data, the accuracy was 99.87%.
For the binary dataset, the accuracy was 93.77%.

  • (Please read the paper directly for more experimental results.)

3.2. SOTA Comparisons

SOTA Comparisons on GitHub Dataset

The proposed architecture outperforms all models for all important performance metrics. The accuracy of the new model is 99.87% which is 0.27% higher than the accuracy of the second-best model built by Shuvo et al. in 2021.

SOTA Comparisons on PhysioNet/CinC Challenge 2016

The new system outperformed the previous state-of-the-art models for all performance metrics. The obtained accuracy is 6.45% higher than the 87.31% accuracy reported by Alkhodari et al. in 2021.

3.3. Time Measurement

Time Measurement

The result shows that it is a lightweight model that can be implemented using embedded systems.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.