Brief Review — Learning Image-based Representations for Heart Sound Classification


Sik-Ho Tsang
3 min readJun 22, 2024

Learning Image-based Representations for Heart Sound Classification
VGGNet+SVM, by University of Augsburg, TechnischeUniversität München, Imperial College London
2018 ICDH, Over 70 Citations (Sik-Ho Tsang @ Medium)

Phonocardiogram (PCG)/Heart Sound Classification
2013 …
2023 … [CTENN] [Bispectrum + ViT] 2024 [MWRS-BFSC + CNN2D]
==== My Other Paper Readings Are Also Over Here ====

  • The scalogram images are extracted from heart sound signal, then fed into an end-to-end CNN for learning representation.
  • This representation is then fed into fully connected layers or SVMs for heart sound classification.


  1. Scalogram+VGGNet+SVM
  2. Results

1. Scalogram+VGGNet+SVM

1.1. Baseline

  • A baseline is established based on the INTERSPEECH COMputational PARalinguistics challengE (ComParE) audio feature set [29].
  • The ComParE feature set is a 6373 dimensional representation, which is fed into SVM for classification.

1.1. Scalogram

Scalogram Examples
  • The scalogram images are generated using the morse wavelet transformation [17] with 2 kHz sampling frequency.
  • They are scaled to 224 × 224 so as to fed into VGG-16.

1.2. VGG-16 as Classification

  • Learning Classifier of VGG-16: by freezing the parameters of the convolutional layers and fc6, and updating (using scalogram images of heart sound data) the parameters of the final two fully connected layers and the soft-max layer for classification.
  • Learning VGG: The last fully connected layer is replaced with a new one which has 2 neurons and a soft-max layer in order to achieve the 2-class classification task. All VGG-16 parameters are adapted to the heart sound data.
  • The cross entropy is applied as the loss function

1.3. VGG-16 + SVM

  • The activations of the first fully connected layer fc6 of VGG-16 are employed, which has of 4096 attributes.
  • This 4096-D feature vector is fed into linear SVM. This VGG-16 can be pretrained or learnt, as described in 1.2 above.

1.4. Late-Fusion Strategy

  • As the PCG recordings in the PhysioNet/ CinC Challenge are of varying lengths, the recordings are segmented into non-overlapping chunks of 4 seconds.
  • A late-fusion strategy is therefore employed to produce a single label (normal/ abnormal) per recording.
  • The label of a PCG sample according to the highest probability max {pi} gained from each chunk is chosen.

2. Results

2.1. PhysioNet/CinC Challenge 2016: Databases

PhysioNet/CinC Challenge 2016: Databases
  • The above approaches are evaluated on the database of PhysioNet/CinC Challenge 2016.
  • As the test set labels for this data are not publicly available, the training set of the database is used and is split into a new training/ development/ test set.
  • There are totally 3240 heart sound recordings.
  • The dataset consists of 6 sub-databases from different research groups:
  1. MIT heart sounds database: 409 PCG and ECG recordings sampled at 44.1 kHz with 16 bit quantisation.
  2. AADheart sounds database: recorded at a 4 kHz sample rate and 16 bit quantisation. It contains 544 recordings.
  3. AUTH heart sounds database: 45 recordings in total.
  4. UHAheart sounds database: It contains 39 recordings.
  5. DLUT heart sounds database: includes 338 recordings.
  6. SUA heart sounds database: is constructed from 81 recordings.

2.2. Performance Comparison

Performance Comparison

All CNN-based approaches achieve improvements in MAcc over the baseline on test set.

  • Although this consistency is not seen on the development set, the MAccs achieved on the test set indicate that the deep representation features extracted from scalogram images perform stronger and more robustly than conventional audio features when performing heart sound classification.
  • An in-general trend of the SVM classification of features extracted from either the pre-trained or the learnt VGG topologies performing stronger than the CNN classifiers can be observed.

Finally, the strongest performance, 56.2 % MAcc is obtained on the test set using the method ‘learnt VGG+SVM’. This MAcc offers a significant relative improvement of 19.8 % on the baseline classifier.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.