Brief Review — Residual Networks Behave Like Ensembles of Relatively Shallow Networks
Removing Residual Modules Leads to Insignificant Loss Only
Residual Networks Behave Like Ensembles of Relatively Shallow Networks,
Veit NIPS’16, by Cornell University,
2016 NIPS, Over 1000 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- Residual networks can be rewritten as an explicit collection of paths.
- These paths show ensemble-like behavior in the sense that they do not strongly depend on each other.
Outline
- Residual Network Has Ensemble-Like Behavior
- Analytic Results
1. Residual Network Has Ensemble-Like Behavior
1.1. Preliminaries
- Residual module in ResNet is popularly formulated as below:
- where fi() is the sequential operation of convolution (W), batch norm (B) and ReLU (σ):
1.2. Ensemble-Like Behavior
- The output of each stage is based on the combination of two sub-terms.
Thus, the shared structure of the residual network apparent by unrolling the recursion into an exponential number of nested terms, expanding one layer at each substitution step:
The graph makes clear that data flows along many paths from input to output. Each path is a unique configuration of which residual module to enter and which to skip.
1.3. Comparison to Plain Network
- In the classical plain network, each layer of processing depends only on the output of the previous layer.
However, in residual networks, each module fi() in the residual network is fed data from a mixture of 2^(i-1) different distributions generated from every possible configuration of the previous i-1 residual modules.
2. Analytic Results
2.1. Deleting individual layers from neural networks at test time
When a layer is removed, the number of paths is reduced from 2^n to 2^(n-1), leaving half the number of paths valid.
Deleting any layer in VGGNet reduces performance to chance levels. Surprisingly, this is NOT the case for ResNet.
2.2. Deleting/Reordering many modules from residual networks at test-time
- One key characteristic of ensembles is that their performance depends smoothly on the number of members.
(a): Deleting increasing numbers of residual modules, increases error smoothly. This implies residual networks behave like ensembles.
(b): k randomly sampled pairs of building blocks with compatible dimensionality are swapped. As corruption increases, the error smoothly increases as well.
2.3. The importance of short paths in residual networks
Surprisingly, almost all of the gradient updates during training come from paths between 5 and 17 modules long.
These are the effective paths, even though they constitute only 0.45% of all paths through this network. Moreover, in comparison to the total length of the network, the effective paths are relatively shallow.
2.4. Stochastic Depth
- Removing residual modules mostly removes long paths.
Left: Even after deleting 10 residual modules, many of the effective paths between 5 and 17 modules long are still valid.
Right: A random subset of the residual modules is selected for each mini-batch during training. Training with Stochastic Depth improves resilience slightly.
- (Please feel free to read Stochastic Depth if interested.)
- Right now, there are many models proposed with much fewer number of weight layers, i.e. shallower.