Brief Review — Audio Set: An Ontology and Human-Labeled Dataset for Audio Events

AudioSet Dataset, Over 1.7M Segments, 485 Audio Event Categories

4 min readOct 30, 2023

Audio Set: An Ontology and Human-Labeled Dataset for Audio Events
AudioSet, by Google, Inc.,
2017 ICASSP, Over 2700 Citations (Sik-Ho Tsang @ Medium)
Sound Classification / Audio Tagging / Sound Event Detection (SED)
2015 [ESC-50, ESC-10, ESC-US] 2021 [Audio Spectrogram Transformer (AST)]

In the field of computer vision, there is a very famous dataset called ImageNet, comprising over a million of images. Yet, in the field of audio, there was no such dataset (at that moment). This paper describes the creation of AudioSet, a large-scale dataset of manually-annotated audio events.
Using a carefully structured hierarchical ontology of 632 audio classes, data is collected from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos.

Outline

AudioSet Dataset
Benchmarking

1. AudioSet Dataset

1.1. Ontology

The ontology is released as a JSON file containing the following fields for each category:

ID: The Knowledge Graph Machine ID (MID) best matching the sound or source, used as the primary identifier for the class. In Knowledge Graph, many MIDs refer to specific objects (e.g., /m/0gy1t2s, “Bicycle bell” or /m/02y 763, “Sliding door”).
Display name: A brief one or two word name for the sound, sometimes with a small number of comma-separated alternatives (e.g., “Burst, pop”), and sometimes with parenthesized disambiguation (e.g. “Fill (with liquid)”).
Description: A longer description, typically one or two sentences, used to provide more explanation of the meaning and limits of the class. Many cases are based on Wikipedia and WordNet.
Examples: As an alternative to the textual descriptions, at least one example of the sound is also collected (excepting “Abstract” classes, described below). At present, all examples are provided as URLs indicating short excerpts from public YouTube videos.
Children: The hierarchy is encoded by including within each node the MIDs of all the immediate children of that category.
Restrictions: Of the 632 categories, 56 are “blacklisted”, meaning they are not exposed to labelers because they have turned out to be obscure (e.g., “Alto saxophone”) or confusing (e.g., “Sounds of things”). Another 22 nodes are marked “Abstract” (e.g., “Onomatopoeia”), meaning that they exist purely as intermediate nodes to help structure the ontology.

The above figure shows an JSON example.

1.2. Rating (Labeling)

Human raters were presented with a 10-second segments including both the video and audio components, but did not have access to the title or other meta-information.

For each segment, raters were asked to independently rate the presence of one or more labels. The possible ratings were “present”, “not present” and “unsure”.

Each segment was rated by three raters and a majority vote is used.
The raters were unanimous in 76.2% of votes. The “unsure” rating was rare, representing only 0.5% of responses, so 2:1 majority votes account for 23.6% of the decisions.
Spot checking is performed. This exposed some commonly misinterpreted labels, which were then removed from the ontology. Due to the scale of the data and since majority agreement was very high, no other other corrective actions were taken.
To select the video segment for rating, about half of the audio events corresponded to labels already predicted by an internal video-level automatic annotation system. (Please read paper for more details.)

1.3. Dataset

The released dataset constitutes a subset of the collected material.

Maximally-balanced train and test subsets are provided (from disjoint videos), chosen to provide at least 50 positive examples (in both subsets) for as many classes as possible. Even so, very common labels such as “Music” ended up with more than 5000 labels. The resulting dataset includes 1,789,621 segments (4,971 hours), comprising at least 100 instances for 485 audio event categories.
The unbalanced train set contains 1,771,873 segments and the evaluation set contains 17,748. Because single segments can have multiple labels (on average 2.7 labels per segment), the overall count of labels is not uniform, and is distributed as shown in the above Fig. 3.

2. Benchmarking

Using the embedding layer representation of a deep-network classifier trained on a large set of generic video topic labels [22], the training portion of the AudioSet YouTube Corpus is used to train a shallow fully-connected neural network classifier for the 485 categories in the released segments.

The test partition is used for evaluation by applying the classifier to 1 sec frames taken from each segment, averaging the scores, then for each category ranking all segments by their scores.

This system gave a balanced mean Average Precision across the 485 categories of 0.314, and an average AUC of 0.959 (corresponding to a d-prime class separation of 2.452).

The category with the best AP was “Music” with AP / AUC / d-prime of 0.896 / 0.951 / 2.34 (reflecting its high prior); the worst AP was for “Rattle” with 0.020 / 0.796 / 1.168.