Brief Review — The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring

INTERSPEECH 2017 Challenges: Addressee, Cold & Snoring

Sik-Ho Tsang
5 min readAug 18, 2024
Snore Sound: VOTE (Image from paper: Automatic classification of excitation location of snoring sounds)

The INTERSPEECH 2017 Computational Paralinguistics Challenge: Addressee, Cold & Snoring
INTERSPEECH 2017 Challenges: Addressee, Cold & Snoring
, by Imperial College London, University of Passau, FAU Erlangen-Nuremberg, Duke University, University of Wuppertal, Technische Universität München, Max Planck Institute for Psycholinguistics, Purdue University, University of Manitoba, University of California, Alfried Krupp Krankenhaus, Carl-Thiem-Klinikum
2017 InterSpeech, Over 180 Citations (Sik-Ho Tsang @ Medium)

==== My Healthcare and Medical Related Paper Readings ====
==== My Other Paper Readings Are Also Over Here ====

  • The INTERSPEECH 2017 Computational Paralinguistics Challenge addresses three different problems for the first time in research competition.
  • In the Addressee sub-challenge, it has to be determined whether speech produced by an adult is directed towards another adult or towards a child.
  • In the Cold sub-challenge, speech under cold has to be told apart from ‘healthy’ speech.
  • In the Snoring sub-challenge, 4 different types of snoring have to be classified.
  • In this paper, end-to-end learning with convolutional and recurrent neural networks are also developed.

Outline

  1. Datasets: Addressee (A), Cold (C), Snoring (S)
  2. Approaches: e2e, COMPARE, BoAW
  3. Results

1. Datasets: Addressee (A), Cold (C), Snoring (S)

Datasets

1.1. Addressee (A)

  • HOMEBANK CHILD/ADULT ADDRESSEE CORPUS (HB-CHAAC) is used.
  • The task is to differentiate between speech produced by an adult that is directed to a child (child-directed speech, CDS) or directed to another adult (adult-directed speech, ADS).
  • HB-CHAAC consists of a set of conversations (see below) selected from a much larger corpus of real-world child language recordings known as HomeBank [12] (homebank.talkbank.org).
Distribution of recordings and child age
  • A spread of ages sampled as uniformly as possible between 2 and 24 months and across the 4 contributing laboratory datasets, with each child only sampled once.

1,220 conversations, consisting of total 2,523 minutes of recordings, are selected. 3 trained research assistants judged whether each clip was directed to a child (CDS) or an adult (ADS) using both acoustic-phonetic information and context.

  • All CDS and ADS clips were additionally labelled by the research assistant as to whether the speaker was male or female.

1.2. Cold (C)

  • UPPER RESPIRATORY TRACT INFECTION CORPUS (URTIC) consists of recordings of 630 subjects, made in quiet rooms with a microphone/headset/hardware setup.
  • The subjects were asked to read out short stories. Besides scripted speech, spontaneous narrative speech was recorded. The whole session lasted from 15 minutes to 2 hours. The available recordings were split into 28,652 chunks with a duration between 3 s and 10 s.
  • Each participant reported a binary one-item measure based on the German version of the Wisconsin Upper Respiratory Symptom Survey (WURSS-24), assessing the symptoms of common cold. The global illness severity item (on a scale of 0 = not sick to 7 = severely sick) was binarised using a threshold at 6.

With speaker-independent partitions, 210 speakers are for each partition. In the training and development partitions, 37 participants were having a cold and 173 participants were not having a cold. The total duration is approximately 45 hours.

1.3. Snoring (S)

  • MUNICH-PASSAU SNORE SOUND CORPUS (MPSSC) is used.
  • Snoring is generated by vibrating soft tissue in the upper airways (UA) during inspiration in sleep.
  • Basic material for the corpus are uncut recordings from Drug Induced Sleep Endoscopy (DISE) examinations from 3 medical centres recorded between 2006 and 2015.
  • During a DISE procedure, a flexible nasopharyngoscope is introduced into the UA while the patient is in a state of artificial sleep. Vibration mechanisms and locations can be observed while video and audio signals are recorded. It is time consuming. Therefore it is desirable to develop alternative methods for the classification of snore sounds, e. g., based on acoustic features.
  • More than 30 hours of DISE recordings have been automatically screened for audio events.
  • The extracted events were manually selected, non-snore events and events disturbed by non-static background noise were discarded. The remaining snore events have been classified by ear, nose, and throat experts. Only events with a clearly identifiable, single site of vibration and without obstructive disposition were included in the database.
  • Four classes are defined based on the VOTE scheme:
  1. V — Velum (palate), including soft palate, uvula, lateral velopharyngeal walls
  2. O — Oropharyngeal lateral walls, including palatine tonsils
  3. T — Tongue, including tongue base and airway posterior to the tongue base.
  4. E — Epiglottis.

The resulting database contains audio samples of 828 snore events from 219 subjects. The number of events per class in the database is strongly unbalanced, with 84% of samples from the classes V and O, 11% E-events, and 5% T-snores.

2. Approaches: e2e, COMPARE, BoAW

2.1. e2e: CNN + LSTM

  • Similar to [20], a convolutional neural network (CNN) is used to extract features from the raw time representation and then a subsequent M-layer recurrent network (LSTM) is used to perform the final classification. (M=2 or 3)
  • The raw waveform is split into chunks of 40 ms each, as a good compromise.

2.2. COMPARE Acoustic Feature Set

  • The official baseline feature set is the same as has been used in the four previous editions of the INTERSPEECH COMPARE challenges.
  • This feature set contains 6,373 static features resulting from the computation of various functionals over low-level descriptor (LLD) contours.
  • SVM with SMO is used.

2.3. Bag-of-Audio-Words (BoAW)

  • One codebook is learnt for the 65 LLDs from the COMPARE feature set and one for the 65 deltas of these LLDs.
  • SVM with SMO is used.

2.4. Fused Model

  • The above 3 models are fused with different combinations for better performance.

3. Results

For the Adressee (A) sub-challenge, the baseline is UAR = 70.2%.
For the Cold (C) sub-challenge, it is UAR = 71.0%.
For the Snoring (S) sub-challenge, it is UAR = 58.5%.

  • All but two of the 60 classifications (20 for each of the three tasks) show the expected gain for Test in comparison to Dev, due to the increased training set (Train plus Dev).
  • The marked difference that it is observed for Snoring might be due as well to the unbalanced distribution between classes.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.