Brief Review — CardioXNet: A Novel Lightweight Deep Learning Framework for Cardiovascular Disease Classification Using Heart Sound Recordings

CardioXNet, End-to-End Learning Framework for Heart Sound Classification

Sik-Ho Tsang
5 min readNov 13, 2023

CardioXNet: A Novel Lightweight Deep Learning Framework for Cardiovascular Disease Classification Using Heart Sound Recordings
, by Bangladesh University of Engineering and Technology, King Saud University, Taiz University
2021 IEEE Access, Over 80 Citations (Sik-Ho Tsang @ Medium)

Heart Sound Classification
2013 … 2020 [1D-CNN] [WaveNet] [Power Features+KNN] 2023 [2LSTM+3FC, 3CONV+2FC] [NRC-Net]
==== My Other Paper Readings Are Also Over Here ====

  • A novel lightweight end-to-end CRNN architecture, namely CardioXNet, is proposed for automatic detection of 5 classes of cardiac auscultation namely normal, aortic stenosis, mitral stenosis, mitral regurgitation and mitral valve prolapse using raw PCG signal.
  • It consists of of 2 learning phases: representation learning and sequence residual learning.


  1. PCG Datasets
  2. CardioXNet
  3. Results

1. PCG Datasets

1.1. GitHub PCG Dataset [18]

Waveform for each of the disease classes.
  • PCG recordings used in the article [18] has been primarily utilized.
  • It contains a total number of 1000 PCG recordings in .wav format in 5 different classes i.e., Normal (N), Aortic stenosis (AS), Mitral regurgitation (MR), Mitral stenosis (MS), Mitral valve prolapse (MVP).
  • Each of the classes has 200 recordings for roughly 3s. All the recordings are sampled at 8 kHz. Since the lowest signal length present in this dataset is 1.125s, all the recordings are truncated from the start of the recording up to 1.125s.

1.2. PhysioNet/CinC Challenge 2016 dataset [39]

  • For further validation, [39] is used as secondary dataset.
  • It contains a total number of 3240 PCG recordings within 6 separately labeled datasets. 7 different research groups have concertedly collected the PCG signals, at either clinical or non-clinical settings.
  • The recordings are originally sampled at 2 kHz and have varying duration (5s-120s).
  • Nevertheless, this dataset does not contain disease-based annotation and contains only 2 classes (normal, abnormal), where the normal annotated 2575 recordings refer to healthy subjects with no valvular defects and the remaining 665 abnormal recordings indicate pathological cases like arrhythmia, coronary heart disease, valvular stenosis, mitral regurgitation etc.
  • For sychronization with [18], only the first 1.125s from each of the PCG recordings of PhysioNet dataset were taken as well.

1.3. Preprocessing

  • All the PCG signals are resampled at 2 kHz that ensures the preservation of important heat sound frequency components as well as lowers the computational cost.
  • The signals are also amplitude normalized.

2. CardioXNet

CardioXNet Architecture Overview

2.1. Representation Learning

It consists of 3 parallel CNN pathways namely, Frequency Feature Extractor (FFE), Pattern Extractor (PE) and Adaptive Feature Enhancer (AFE) [43], [48].

2.1.1. FFE & PE

FFE and PE
  • FFE consists of 4 1D convolutional layers and 2 max-pooling layers with the primary filter size of sampling frequency (Fs)×4 and stride size set to Fs/2 for the 1D convolutional (conv1) layer to capture the frequency components.
  • Similar to FFE, PE also consists of 4 1D convolutional layers and 2 max-pooling layers. However, fine-grained convolution with filter and stride size set to Fs/2, and Fs/16 respectively.

2.1.2. AFE

  • The input sequences are reshaped into a 2D tensor and fed to AFE. AFE consists of 2D convolutional layers, batch normalization, max-pooling layers and squeeze-expansion layers, inspired by the Fire module of SqueezeNet architecture.

Outputs from 3 CNN paths are concatenated together and forwarded to the sequence residual learning part.

2.2. Sequence Residual Learning

  • Sequence residual learning is trained to extract the temporal information from the sequence of extracted features in the representation learning part.
  • Two layers of bi-LSTMs have been employed to learn temporal information which enables the encoding of both past and future information.
  • A skip connection (ResNet) has been employed, enabling the addition of temporal information and previously extracted features from the CNNs. (Here said “addition” based on authors’ writing. Yet, based on the below sentence and Figure 1, it should be “concatenation”.)
  • The concatenated feature vector is fed into a prediction layer with probability nodes, calculated by the softmax function.
  • ReLU is used for all convolutions. Dropout is also used.

3. Results

  • The dataset was divided into training, validation and testing using a ratio of 70%:10%:20%.

3.1. GitHub PCG Dataset [18]

10-fold cross-validation
Precision, Recall and F1-score

The proposed model achieved near perfect validation accuracy on the given dataset, on 10 fold cross validation, showing very high precision and recall scores for all the classes.

Confusion matrix

3.2. PhysioNet/CinC 2016 Challenge Dataset [39]

Confusion matrix
  • The confusion matrix of the best performing models for both GitHub PCG dataset and PhysioNet secondary dataset is shown in Figure 8 and 9.

From Figure 9, the proposed framework performs quite well on PhysioNet dataset.

3.3. PhysioNet-GitHub Merged Dataset

Confusion matrix
  • The generalization potential of the proposed CardioXNet has been tested using a merged dataset of PhysioNet and GitHub PCG recordings for training, validation and testing.
  • Overall the mixed dataset consists of 4240 (3240 + 1000) PCG recordings in which 2757 (2575 + 200) recordings are normal and 1465 (665 + 800) are abnormal. The total recordings available for training are 2968 while 424 and 848 recordings were respectively, used for testing purpose.

Based on Figure 10, CardioXNet showed overall accuracy of 88.09%, 88.08% precision, 87.98% recall and 88.03% F1 Score.

SOTA Comparisons

The proposed CardioXNet architecture outperforms or performs on par with the previous works, achieving an accuracy of 99.6%.

3.4. Computational Efficiency

The proposed lightweight CNN model has extremely low end to end time of 54.60(±0.06) ms.

  • The proposed model has 0.67 M trainable parameters, 26 M FLOPS and a smaller memory requirement of only 7.96 MB.
  • Authors also mentioned the limitations at the end. e.g.: The work can be further improved if multiple larger PCG datasets. The number of PCG data of HVD is limited.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.