Brief Review — Heart sound classification based on log Mel-frequency spectral coefficients features and convolutional neural networks


Sik-Ho Tsang
4 min readDec 7, 2023

Heart sound classification based on log Mel-frequency spectral coefficients features and convolutional neural networks
, by Yunnan University, Fuwai Yunnan Cardiovascular Hospital, Kunming Medical University
2021 J. BSPC, Over 30 Citations (Sik-Ho Tsang @ Medium)

Heart Sound Classification
2022 [CirCor Dataset] [CNN-LSTM] [DsaNet] [Modified Xception] [Improved MFCC+Modified ResNet] [Learnable Features + VGGNet/EfficientNet] 2023 [2LSTM+3FC, 3CONV+2FC] [NRC-Net]
==== My Other Paper Readings Are Also Over Here ====

Overall Framework
  • Firstly, the heart sound signals were de-noised by using the wavelet algorithm. Subsequently, the improved duration-dependent hidden Markov model (DHMM) was used to segment the heart sound signal according to the heart cycle.
  • Then, the dynamic frame length method was used to extract log Mel-frequency spectral coefficients (MFSC) features from the heart sound signal based on the heart cycle.
  • Afterward, the convolution neural network (CNN) was used to classify the MFSC features. Finally, majority vote algorithm is used to get the prediction.


  1. Dataset,Preprocessing & Segmentation
  2. Feature Extraction, CNN & Majority Vote
  3. Results

1. Dataset, Preprocessing & Segmentation

  • The dataset is collected by authors. A total of 1800 cases of the heart sound are used, which included patent ductus arteriosus (PDA), ventricular septal defect (VSD), atrial septal defect (ASD), and normal (N) four stets.
  • The sampling frequency is 5000 Hz.
  • 5-layer decomposition based on db2 wavelet is used as a denoising method.
Some Segmentation Examples
  • Improved duration-dependent hidden Markov model (Improved DHMM) is used for segmentation.
  • (Please read the paper directly for more details of segmentations.)

2. Feature Extraction, CNN & Majority Vote

2.1. Feature Extraction

Some MFSC Examples
  • Framing: The signal is divided into short time frames. In order to achieve a smooth transition between frames, a 50% overlap between consecutive frames is used. The total number of frames after framing is M.
  • MFSC: Mel-Frequency Spectral Coefficients (MFSC) is a special form of MFCC that omits the step of Discrete Cosine Transform (DCT).

2.2. CNN

CNN Network Architecture
  • The duration of recording heart sound is 20s, which means that 20 to 33 MFSC feature maps can be obtained after heart sound segmentation for each volunteer.
  • 4 Convolutional layers, 3 max pooling layers, 2 FC layers and then softmax layer. ReLU is used.

2.3. Majority Voting Algorithm

  • The heart sound signal is quasi-periodic, the difference between each cardiac cycle is minimal.
  • However, due to the inevitable introduction of some non-pathological noise (such as breath sound, friction sound, etc.) in the collection process, these factors will lead to the misjudgment of the classifier.
  • The majority voting algorithm is used to determine the final classification results for individuals. The process is:
  1. Count the sequence length of the classification results.
  2. Ergodic sequence and count the number of occurrences of each element in the sequence.
  3. The bubbling sorting method is used to sort the occurrence frequency of each element.
  4. The element with the most frequent occurrence is taken as the final classification result.

3. Results

3.1. Dataset Split

Dataset Split
  • After segmentation, much more samples are obtained.

3.2. 2-Class Results

2-Class Results
2-Class Results

3.3. 4-Class Results

4-Class Results
4-Class Results

3.4. SOTA Comparison on 2-Class Problem

SOTA Comparison on 2-Class Problem

The algorithm proposed in this paper is superior to other algorithms in sensitivity, specificity, and accuracy.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.