Brief Review — Very Deep Convolutional Neural Networks for Raw Waveforms

M3, M5, M11, M18, M34-res are Designed

Sik-Ho Tsang
4 min readFeb 16, 2024

Very Deep Convolutional Neural Networks for Raw Waveforms
, by Carnegie Mellon University, Stanford University, Bosch
2017 ICASSP, Over 440 Citations (Sik-Ho Tsang @ Medium)

Sound Classification / Audio Tagging / Sound Event Detection (SED)
2015 [ESC-50, ESC-10, ESC-US] 2017 [AudioSet / Audio Set] 2021 [Audio Spectrogram Transformer (AST)]
==== My Other Paper Readings Are Also Over Here ====

  • Very deep convolutional neural networks (CNNs) are designed that directly use time-domain waveforms as inputs.
  • The proposed CNNs, with up to 34 weight layers, are efficient to optimize over very long sequences (e.g., vector of size 32000), necessary for processing acoustic waveforms.
  • This is achieved through batch normalization, residual learning, and a careful design of down-sampling in the initial layers.
  • It is named DaiNet, by PANNs, based on the surname of the first author.


  1. M3, M5, M11, M18, M34-res
  2. Results

1. M3, M5, M11, M18, M34-res

M3, M5, M11, M18, M34-res

1.1. Deep Architecture

To build very deep networks, a very small receptive field 3 is used for all but the first 1D convolutional layers.

  • Furthermore, the temporal resolution is aggressively reduced in the first two layers by 16x with large convolutional and max pooling strides to limit the computation cost in the rest of the network.
  • After the first two layers, the reduction of resolution is complemented by a doubling in the number of feature maps.
  • ReLU is used.

1.2. Fully Convolutional Network (FCN)

  • Most deep convolutional networks for classification use 2 or more fully connected (FC) layers of high dimensions, leading to a very high number of parameters.

Instead of FC layers, a single global average pooling layer is used, which reduces each feature map into one float by averaging the activation across the temporal dimension.

1.3. First Layer Receptive Field

The first layer receptive field is chosen to cover a 10-millisecond duration, which is similar to the window size for many MFCC computation.

1.4. Batch Normalization (BN)

  • BN is applied on the output of each convolutional layer before applying ReLU non-linearity.

1.5. Residual Learning

Residual learning is achieved through a skip connection in the residual block (“res-block”). Residual learning is applied in M34-res.

2. Results

2.1. Dataset

UrbanSound8k dataset is used, which contains 10 environmental sounds in urban areas, such as drilling, car horn, and children playing [13].

  • The dataset consists of 8732 audio clips of 4 seconds or less, totalling 9.7 hours.
  • The official fold 10 is used as the test set, and the rest for training and validation.
  • For computational speed, the audio waveforms are down-sampled to 8kHz and standardized to 0 mean and variance 1. The training data is shuffled without any data augmentation.

2.2. Performance

Test Accuracy & Training Time
  • M3 perform very poorly compared with the other models, indicating that 2-layered CNNs are insufficient to extract discriminative features from raw waveforms for sound recognition.
  • Deeper networks (M5, M11, M18, M34-res) substantially improve the performance. The test accuracy improves with increasing network depth for M5, M11, and M18.

The best model M18 reaches 71.68% accuracy that is competitive with the reported test accuracy of CNNs on spectrogram input using the same dataset [11].

2.3. Ablation Studies

Ablation Studies
  • Table 3: The performance of srf (Receptive Field, RF, size of 8) degrades significantly by up to 6.6% compared with M11 and M18 with RF 80.
  • Table 4: M5-big has 2.2M parameters but only achieves 63.30% accuracy, compared with the 69.07% by M11 (1.8M parameters). By using a very deep architecture, M18 outperforms M3 by as much as 15.56% in absolute accuracy, which shows that deeper architectures substantially improve acoustic modeling using waveforms.
  • Table 5: FC layers can increase number of parameters significantly and increase training time by 295%. However, FC layers do not improve test accuracy.
  • Table 6: M18-no-bn results in lower test accuracy, indicating that BN has a regularization effect.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.