Brief Review — PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
Pretrained audio neural networks (PANNs) trained on the Large-Scale AudioSet Dataset
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
PANNs, by University of Surrey, ByteDance AI Lab, Qingdao University of Science and Technology
2020 TALSP, Over 880 Citations (Sik-Ho Tsang @ Medium)Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text
1991 … 2020 [FAIRSEQ S2T]
==== My Other Paper Readings Are Also Over Here ====
- An architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature.
- Pretrained audio neural networks (PANNs) are pretrained on the large-scale AudioSet dataset.
Outline
- PANNs
- Results
1. PANNs
1.1. CNNs, ResNets, MobileNets
- For image-classification CNN, CNN6, CNN10, CNN14 are tried.
- For ResNet, ResNet22, ResNet38, ResNet54 are designed.
- For MobileNet, MobileNetV1 and MobileNetV2 are used.
- Besides, one-dimensional CNNs are also tried, such as DaiNet and LeeNet.
1.2. Proposed Wavegram-CNN & Wavegram-Logmel-CNN
- Wavegram is a feature that is similar to log mel spectrogram, but is learnt using a neural network.
- Wavegram-CNN (Left Branch): CNN14 is used as backbone to extract wavegram.
- Right Branch: Waveform is transformed to log mel spectrogram then convolved by CNN.
1.3. Data Balancing & Data Augmentation
- For example, there are over 900,000 audio clips belonging to the categories “Speech” and “Music”. On the other hand, there are only tens of audio clips belonging to the category “Toothbrush”.
- A balanced sampling strategy is applied to train PANNs. That is, audio clips are approximately equally sampled from all sound classes to constitute a minibatch.
- mixup and SpecAugment are used as data augmentation techniques.
- mixup is a way to augment a dataset by interpolating both the input and target of two audio clips from a dataset.
- SpecAugment operates on the log mel spectrogram of an audio clip using frequency masking and time masking.
1.4. Transfer Learning
- DAudioSet is the AudioSet dataset, and x0, y0 are training input and target, respectively.
- (a) Train from sctrach.
- (b) Then, a PANN can be used as a feature extractor, which is frozen. A classifier is added on top of it for new task.
- (c) Or the whole model is fine-tuned.
2. Results
2.1. CNN14 Performance
- CNN14 system achieves an mAP of 0.431, outperforming the best of previous systems.
- CNN14 is used as a backbone to build Wavegram-Logmel-CNN for fair comparison with the CNN14 system.
2.2. Wavegram-Logmel-CNN Performance
The proposed Wavegram-Logmel-CNN system achieves a state-of-the-art mAP of 0.439 among all PANNs, outperforms LeeNet and DaiNet.
2.3. Transfer Learning to ESC-50
The proposed fine-tuned system achieves an accuracy of 0.947, outperforming previous state-of-the-art system by a large margin.
- (5 more datasets are also evaluated, please kindly read the paper for more details.)