Brief Review — Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
Sample-Level DCNN
Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms
Sample-Level DCNN (LeeNet), by Korea Advanced Institute of Science and Technology (KAIST)
2017 SMC, Over 220 Citations (Sik-Ho Tsang @ Medium)Sound Classification / Audio Tagging / Sound Event Detection (SED)
2015 [ESC-50, ESC-10, ESC-US] 2017 [AudioSet / Audio Set] [M3, M5, M11, M18, M34-res (DaiNet)] 2021 [Audio Spectrogram Transformer (AST)]
==== My Other Paper Readings Are Also Over Here ====
- Sample-level deep convolutional neural network is proposed for music auto-tagging, which learns representations from very small grains of waveforms (e.g. 2 or 3 samples) beyond typical frame-level input representations.
- It is named LeeNet by PANNs, based on the surname of the first author.
Outline
- Sample-Level DCNN
- Results
1. Sample-Level DCNN
1.1. Frame-level approach using mel-spectrogram (left)
- This is the most common CNN model used in music autotagging.
The 2D time-frequency representation is used as input.
1.2. Frame-level approach using raw waveforms (middle)
- A strided convolution layer is added beneath the bottom layer of the frame-level mel-spectrogram model.
- The strided convolution layer is expected to learn a filter-bank representation.
In this model, once the raw waveforms pass through the first strided convolution layer, the output feature map has the same dimensions as the mel-spectrogram.
1.3. Proposed sample-level approach using raw waveforms (right)
- Simply adding a strided convolution layer is not sufficient to overcome the problems.
To improve this, multiple layers are added beneath the frame-level such that the first convolution layer can handle much smaller length of samples.
1.4. Model Architecture
by the above concept, m^n-DCNN models are proposed where m refers to the filter length and pooling length of intermediate convolution layer modules and n refers to the number of the modules (or depth).
- In the above table, m is 3 and n is 9. 3⁹ model with 19683 frames and 59049 samples as input.
- Sigmoid is used at output layer. Binary cross entropy loss is used.
- For every convolution layer, batch normalization and ReLU are used.
- Dropout of 0.5 is used at the output of the last convolution layer.
- No input normalization is performed.
Different values of m and n form different model variants.
2. Results
- Two datasets are used for evaluation: MagnaTagATune dataset (MTAT) [18] and Million Song Dataset (MSD) annotated with the Last.FM tags [19]. Only the most frequently labeled 50 tags are used in both datasets.
- AUC is measured.
The proposed sample-level raw waveform model achieves results comparable to the frame-level mel-spectrogram model.
The proposed sample-level architecture is highly effective compared to SOTA approaches.