Brief Review — Multimodal CNN Fusion Architecture With Multi-features for Heart Sound Classification

Multimodal CNN

Sik-Ho Tsang
3 min readMar 10, 2024

Multimodal CNN Fusion Architecture With Multi-features for Heart Sound Classification
Multimodal CNN
, by Concordia University
2021 ISCAS, Over 10 Citations (Sik-Ho Tsang @ Medium)

Heart Sound Classification
2023 [2LSTM+3FC, 3CONV+2FC] [NRC-Net] [Log-MelSpectrum + Modified VGGNet] [CNN+BiGRU] [CWT+MFCC+DWT+CNN+MLP] [LSTM U-Net (LU-Net)]
==== My Other Paper Readings Are Also Over Here ====

  • Instead of using features from just one domain, general frequency features as well as Mel domain features are extracted from the raw heart sound.
  • The multimodal CNN fusion architecture is individually trained based on the feature maps resulting from various feature extraction methods.
  • These feature maps are then merged for optimizing the diversified extracted features.


  1. Multimodal CNN
  2. Results

1. Multimodal CNN

Multimodal CNN

1.1. Feature Extraction

  • Features are extracted from two domains, namely, MFCCs and Mel features in Mel domain, and chroma, spectral contrast and tonnetz (CST) as general frequency features in the frequency domain.

1.1.1. Mel Domain

  • The dimension of MFCCs feature is set to 40 in this work.
  • Mel features and MFCCs features are extracted using the same procedure except for the last step, where DCT computation is used to extract the MFCCs features. The dimension of Mel feature is 128.

1.1.2. General Frequency Features

  • These features contain three parts.
  • The chroma-based feature represents the spectral energy contained in each of the twelve standard equal-tempered scale pitch classes [18], [19]. Besides, it is relatively immune to background noise.
  • The spectral contrast feature can be obtained from a spectrogram that reflects time-varying or rhythmic information of signals.
  • Tonal centroid features, known as tonnetz, provide distinctive features for audio signals [21]. Such characteristics are useful in recognizing changes in the harmonic content of audio signals.
  • The dimensions of chroma, spectral contrast and tonnetz (CST) are 12, 7 and 6, respectively.

1.2. Multimodal CNN

  • Automatic feature learning block is composed of multiple CNN bases.
  • Each 1D CNN base has 1D convolution layer with kernel size 3, stride size 1 and global average pooling.
  • When input features are MFCCs, Mel and CST, the dimensions of the output feature vectors from the corresponding CNN are 40, 128, and 25, respectively.
  • In the fusion step, the generated feature vectors from CNN are concatenated to create one feature vector.
  • Fully connected layers and softmax are responsible for learning the discriminative features.
  • The classifier contains two dense layers, where the first layer has 40 neurons with ReLU activation, while the final output layer contains 5 neurons with a softmax activation.

2. Results

The trimodal system achieves an overall accuracy of 98.5% in the classification and is superior to the unimodal and bimodal systems. A similar trend is observed for the f1 score and score.

SOTA Comparisons
  • The unimodal system with only MFCCs features achieves an accuracy of 88.1, which is already higher than that of the DNN based approach [2].

The trimodal system achieves the best results.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.