Brief Review — Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms

Sample-Level DCNN

3 min readFeb 17, 2024

Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
Sample-Level DCNN (LeeNet), by Korea Advanced Institute of Science and Technology (KAIST)
2017 SMC, Over 220 Citations (Sik-Ho Tsang @ Medium)
Sound Classification / Audio Tagging / Sound Event Detection (SED)
2015 [ESC-50, ESC-10, ESC-US] 2017 [AudioSet / Audio Set] [M3, M5, M11, M18, M34-res (DaiNet)] 2021 [Audio Spectrogram Transformer (AST)]
==== My Other Paper Readings Are Also Over Here ====

Sample-level deep convolutional neural network is proposed for music auto-tagging, which learns representations from very small grains of waveforms (e.g. 2 or 3 samples) beyond typical frame-level input representations.
It is named LeeNet by PANNs, based on the surname of the first author.

Outline

Sample-Level DCNN
Results

1. Sample-Level DCNN

**Frame-level approach using mel-spectrogram (left), frame-level approach using raw waveforms (middle) and proposed sample-level approach using raw waveforms (right).**

1.1. Frame-level approach using mel-spectrogram (left)

This is the most common CNN model used in music autotagging.

The 2D time-frequency representation is used as input.

1.2. Frame-level approach using raw waveforms (middle)

A strided convolution layer is added beneath the bottom layer of the frame-level mel-spectrogram model.
The strided convolution layer is expected to learn a filter-bank representation.

In this model, once the raw waveforms pass through the first strided convolution layer, the output feature map has the same dimensions as the mel-spectrogram.

1.3. Proposed sample-level approach using raw waveforms (right)

Simply adding a strided convolution layer is not sufficient to overcome the problems.

To improve this, multiple layers are added beneath the frame-level such that the first convolution layer can handle much smaller length of samples.

1.4. Model Architecture

by the above concept, m^n-DCNN models are proposed where m refers to the filter length and pooling length of intermediate convolution layer modules and n refers to the number of the modules (or depth).

In the above table, m is 3 and n is 9. 3⁹ model with 19683 frames and 59049 samples as input.
Sigmoid is used at output layer. Binary cross entropy loss is used.
For every convolution layer, batch normalization and ReLU are used.
Dropout of 0.5 is used at the output of the last convolution layer.
No input normalization is performed.

Different values of m and n form different model variants.

2. Results

**Comparison of three CNN models with different window (filter length) and hop (stride) sizes**

Two datasets are used for evaluation: MagnaTagATune dataset (MTAT) [18] and Million Song Dataset (MSD) annotated with the Last.FM tags [19]. Only the most frequently labeled 50 tags are used in both datasets.
AUC is measured.

The proposed sample-level raw waveform model achieves results comparable to the frame-level mel-spectrogram model.

**Comparison of the proposed works to prior state-of-the-arts**

The proposed sample-level architecture is highly effective compared to SOTA approaches.

Brief Review — Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms

Sample-Level DCNN

Outline

1. Sample-Level DCNN

1.1. Frame-level approach using mel-spectrogram (left)

1.2. Frame-level approach using raw waveforms (middle)

1.3. Proposed sample-level approach using raw waveforms (right)

1.4. Model Architecture

2. Results

Written by Sik-Ho Tsang