Brief Review — ESC: Dataset for Environmental Sound Classification
ESC-50, ESC-10, ESC-US Datasets for Audio Tagging
ESC: Dataset for Environmental Sound Classification
ESC-50, ESC-10, ESC-US, by Warsaw University of Technology
2015 ACM MM, Over 1200 Citations (Sik-Ho Tsang @ Medium)
Sound Classification, Audio Tagging
==== My Other Paper Readings Are Also Over Here ====
- ESC-50 dataset is proposed, which is an annotated collection of 2000 short clips comprising 50 classes of various common sound events. There are also ESC-10 and ESC-US proposed.
- The paper also provides an evaluation of human accuracy in classifying environmental sounds and compares it to the performance of selected baseline classifiers using features derived from mel-frequency cepstral coefficients (MFCC)and zero-crossing rate.
Outline
- ESC-50, ESC-10, ESC-US Datasets
- Results
1. ESC-50, ESC-10, ESC-US Datasets
- ESC means Environmental Sound Classification.
1.1. ESC-50
The ESC-50 dataset consists of 2000 labeled environmental recordings equally balanced between 50 classes (40 clips per class).
- For convenience, they are grouped in 5 loosely defined major categories (10 classes per category): animal sounds, natural soundscapes and water sounds, human (non-speech) sounds, interior/domestic sounds, exterior/urban noises.
- The dataset provides an exposure to a variety of sound sources — some very common (laughter, cat meowing, dog barking), some quite distinct (glass breaking, brushing teeth) and then some where the differences are more nuanced (helicopter and airplane noise).
One of the possible deficiencies of this dataset is the limited number of clips available per class. This is related to the high cost of manual annotation and extraction, and the decision to maintain strict balance classes.
1.2. ESC-10
- The ESC-10 is a selection of 10 classes from the bigger dataset, representing three general groups of sounds:
- Transient/percussive sounds, sometimes with very meaningful temporal patterns (sneezing, dog barking, clock-ticking), sound events with strong harmonic content (crying baby, crowing rooster), more or less structured noise/soundscapes (rain, seawaves, re-crackling, helicopter, chainsaw).
- This subset should provide an easier problem to start with.
1.3. ECS-US
- The limited number of instances available in the labeled part of the dataset makes it rather inadequate for more complex knowledge discovery approaches like learning representations from data.
- To mitigate this issue, ECS-US provides an additional dataset of 250 000 recordings.
- Although the ESC-US dataset should be treated as not hand-annotated and is presented as such, it does include the metadata (tags/sound descriptions).
- It should be more fitting for procedures involving unsupervised pretraining and generative models. Apart from clustering and manifold learning experiments, could be also used in weakly supervised learning regimes (classification with labels partially missing or not specific enough).
2. Results
2.1. Human Evaluation
The human auditory system has little problem recognizing a plethora of sound stimuli, even in very noisy conditions. The real question was: how easy is it?
- The average accuracy achieved was 95.7% for the ESC-10 dataset and 81.3% for ESC-50.
- Recall for individual classes varied greatly between types of sound events — from 34.1% for washing machine noise to almost 100% for crying babies and barking dogs.
Based on these experiments, one can expect that trained and attentive listeners could score flawlessly on the smaller dataset and most probably achieve accuracy levels reaching 90% on the main dataset, with some room for error when classifying more ambiguous mechanical noises and soundscapes.
2.2. Baseline Classifiers
2.2.1. Features
Two types of features were extracted from each clip: zero-crossing rate and mel-frequency cepstral coecients (MFCC).
- The former is a very simple, yet useful feature, whereas the latter are ubiquitous in speech processing and analyzing harmonic content.
- For MFCC, discarding the 0-th coefficient, first 12 MFCCs and zero-crossing rate were summarized for each clip with their mean and standard deviation across frames.
2.2.2. Classifiers
Three types of classifiers: k-nearest neighbors (k-NN), random forest ensemble and support vector machine (SVM) with linear kernel.
- 5-fold cross-validation regime is used.
2.2.3. Machine Learning Results
The ESC-10 dataset had an average classification accuracy ranging from 66.7% for the k-NN classifier to 72.7% for the random forest ensemble, with SVM in the middle (67.5%).
The ESC-50 dataset had less variability between folds when validating the models, but more pronounced outperformance by the random forest ensemble (44.3%) as compared to SVM (39.6%) or k-NN (32.2%).
The ESC-50 dataset had less variability between folds when validating the models, but more pronounced outperformance by the random forest ensemble (44.3%) as compared to SVM (39.6%) or k-NN (32.2%).
- The SVM classifier performed better for animal sounds than the random forest ensemble.