Brief Review — WaveNet: A Generative Model for Raw Audio

WaveNet, Generates Speech & Music

Sik-Ho Tsang
5 min readJul 1, 2024
A second of generated speech (Image from Google DeepMind)

WaveNet: A Generative Model for Raw Audio
WaveNet
, by Google DeepMind
2016 SSW, Over 1600 Citations (Sik-Ho Tsang @ Medium)

Acoustic Model / Automatic Speech Recognition (ASR) / Speech-to-Text Modeling
1991
2019 [SpecAugment] [Cnv Cxt Tsf] 2020 [FAIRSEQ S2T] [PANNs] [Conformer]
==== My Other Paper Readings Are Also Over Here ====

  • WaveNet is proposed, which can be efficiently trained on data with tens of thousands of samples per second of audio.
  • It can be applied to text-to-speech, music generation and speech recognition.
  • The corresponding audio can be listened from official website: https://deepmind.google/discover/blog/wavenet-a-generative-model-for-raw-audio/
  • (I read this paper as it was introduced by a colleague last week when we were discussing ASR.)

Outline

  1. Preliminaries
  2. WaveNet
  3. Results

1. Preliminaries

  • The joint probability of a waveform x = {x1, …, xT} is factorised as a product of conditional probabilities as follows:
  • Normally, the conditional probability distribution is modelled by a stack of convolutional layers:

However, the receptive field is small even having numerous layers.

2. WaveNet

2.1. Dilated Convolution

Dilated Convolution (Image from Google DeepMind)
  • Dilated convolutions, which uses alternative-spaced neuron at the previous layer for information propagation, have previously been used in DeepLab and DilatedNet.
  • The above figure depicts dilated causal convolutions for dilations 1, 2, 4, and 8.

In this paper, the dilation is doubled for every layer up to a limit and then repeated.

2.2. Softmax Distribution

  • Because raw audio is typically stored as a sequence of 16-bit integer values (one per timestep), a softmax layer would need to output 65,536 probabilities per timestep to model all possible values.

To make this more tractable, a μ-law companding transformation (ITU-T, 1988) is first applied to the data, and it is quantized to 256 possible values:

  • This non-linear quantization produces a significantly better reconstruction than a simple linear quantization scheme.

2.3. Gated Activation Units

Gated Activation Units are used, which is better than ReLU:

2.4. Residual and Skip Connections

Overview of the residual block and the entire architecture.

Both residual (ResNet) and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models.

2.5. Conditional WaveNets

Given an additional input h, WaveNets can model the conditional distribution p(x|h) of the audio given this input.

  • By conditioning the model on other input variables, we can guide WaveNet’s generation to produce audio with the required characteristics. For example, in a multi-speaker setting we can choose the speaker by feeding the speaker identity to the model as an extra input. Similarly, for TTS we need to feed information about the text as an extra input.

Global conditioning is characterised by a single latent representation h that influences the output distribution across all timesteps, e.g. a speaker embedding in a TTS model.

For local conditioning we have a second timeseries ht, possibly with a lower sampling frequency than the audio signal, e.g. linguistic features in a TTS model.

3. Results

3.1. Multi-Speaker Speech Generation

  • The English multi-speaker corpus from CSTR voice cloning toolkit (VCTK) is used. The dataset consisted of 44 hours of data from 109 different speakers.

It generates non-existent but human language-like words in a smooth way with realistic sounding intonations. This is similar to generative models of language or images, where samples look realistic at first glance, but are clearly unnatural upon closer inspection.

The lack of long range coherence is partly due to the limited size of the model’s receptive field (about 300 milliseconds), which means it can only remember the last 2–3 phonemes it produced.

3.2. Text-To-Speech (TTS)

Text-To-Speech (TTS) MOS

The single-speaker speech databases from Google’s North American English and Mandarin Chinese TTS systems are used.

  • North American English dataset contains 24.6 hours of speech data, and the Mandarin Chinese dataset contains 34.8 hours.
  • The receptive field size of the WaveNets was 240 milliseconds.
  • In the MOS tests, after listening to each stimulus, the subjects were asked to rate the naturalness of the stimulus in a five-point Likert scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent).

WaveNets achieved 5-scale MOSs in naturalness above 4.0, which were significantly better than those from the baseline systems.

Subjective paired comparison test results

It can be seen from the results that WaveNet outperformed the baseline statistical parametric and concatenative speech synthesizers in both languages.

WaveNet conditioned on linguistic features could synthesize speech samples with natural segmental quality but sometimes it had unnatural prosody by stressing wrong words in a sentence.

3.3. Music

  • 2 music datasets are used: The MagnaTagATune dataset, which consists of about 200 hours of music audio. Each 29-second clip is annotated with tags from a set of 188, and the YouTube piano dataset, which consists of about 60 hours of solo piano music obtained from YouTube videos.

The samples were often harmonic and aesthetically pleasing, even when produced by unconditional models. (No objective results)

  • Of particular interest are conditional music models, after cleaning it up by merging similar tags and removing those with too few associated clips, we found this approach to work reasonably well.

3.4. Speech Recognition

  • Traditionally, speech recognition research has largely focused on using log mel-filterbank energies or mel-frequency cepstral coefficients (MFCCs), but has been moving to raw audio recently.
  • WaveNets are tested on the TIMIT Dataset.
  • For this task, a mean-pooling layer is added after the dilated convolutions.

WaveNet is trained with two loss terms, one to predict the next sample and one to classify the frame, the model generalized better than with a single loss and achieved 18.8 PER on the test set, which is to the best score (at that moment) obtained from a model trained directly on raw audio on TIMIT.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.