Brief Review — Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms

Sample-Level DCNN

Sik-Ho Tsang
3 min readFeb 17, 2024

Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
Sample-Level DCNN (LeeNet)
, by Korea Advanced Institute of Science and Technology (KAIST)
2017 SMC, Over 220 Citations (Sik-Ho Tsang @ Medium)

Sound Classification / Audio Tagging / Sound Event Detection (SED)
2015 [ESC-50, ESC-10, ESC-US] 2017 [AudioSet / Audio Set] [M3, M5, M11, M18, M34-res (DaiNet)] 2021 [Audio Spectrogram Transformer (AST)]
==== My Other Paper Readings Are Also Over Here ====

  • Sample-level deep convolutional neural network is proposed for music auto-tagging, which learns representations from very small grains of waveforms (e.g. 2 or 3 samples) beyond typical frame-level input representations.
  • It is named LeeNet by PANNs, based on the surname of the first author.


  1. Sample-Level DCNN
  2. Results

1. Sample-Level DCNN

Frame-level approach using mel-spectrogram (left), frame-level approach using raw waveforms (middle) and proposed sample-level approach using raw waveforms (right).

1.1. Frame-level approach using mel-spectrogram (left)

  • This is the most common CNN model used in music autotagging.

The 2D time-frequency representation is used as input.

1.2. Frame-level approach using raw waveforms (middle)

  • A strided convolution layer is added beneath the bottom layer of the frame-level mel-spectrogram model.
  • The strided convolution layer is expected to learn a filter-bank representation.

In this model, once the raw waveforms pass through the first strided convolution layer, the output feature map has the same dimensions as the mel-spectrogram.

1.3. Proposed sample-level approach using raw waveforms (right)

  • Simply adding a strided convolution layer is not sufficient to overcome the problems.

To improve this, multiple layers are added beneath the frame-level such that the first convolution layer can handle much smaller length of samples.

1.4. Model Architecture

by the above concept, m^n-DCNN models are proposed where m refers to the filter length and pooling length of intermediate convolution layer modules and n refers to the number of the modules (or depth).

  • In the above table, m is 3 and n is 9. 3⁹ model with 19683 frames and 59049 samples as input.
  • Sigmoid is used at output layer. Binary cross entropy loss is used.
  • For every convolution layer, batch normalization and ReLU are used.
  • Dropout of 0.5 is used at the output of the last convolution layer.
  • No input normalization is performed.
Model Variants

Different values of m and n form different model variants.

2. Results

Comparison of three CNN models with different window (filter length) and hop (stride) sizes
  • Two datasets are used for evaluation: MagnaTagATune dataset (MTAT) [18] and Million Song Dataset (MSD) annotated with the Last.FM tags [19]. Only the most frequently labeled 50 tags are used in both datasets.
  • AUC is measured.

The proposed sample-level raw waveform model achieves results comparable to the frame-level mel-spectrogram model.

Comparison of the proposed works to prior state-of-the-arts

The proposed sample-level architecture is highly effective compared to SOTA approaches.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.