Brief Review — Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Removing Residual Modules Leads to Insignificant Loss Only

Sik-Ho Tsang
4 min readApr 23, 2023

Residual Networks Behave Like Ensembles of Relatively Shallow Networks,
Veit NIPS’16, by Cornell University,
2016 NIPS, Over 1000 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • Residual networks can be rewritten as an explicit collection of paths.
  • These paths show ensemble-like behavior in the sense that they do not strongly depend on each other.


  1. Residual Network Has Ensemble-Like Behavior
  2. Analytic Results

1. Residual Network Has Ensemble-Like Behavior

1.1. Preliminaries

  • Residual module in ResNet is popularly formulated as below:

1.2. Ensemble-Like Behavior

An unraveled view of a 3-block residual network
  • The output of each stage is based on the combination of two sub-terms.

Thus, the shared structure of the residual network apparent by unrolling the recursion into an exponential number of nested terms, expanding one layer at each substitution step:

The graph makes clear that data flows along many paths from input to output. Each path is a unique configuration of which residual module to enter and which to skip.

1.3. Comparison to Plain Network

Plain Network such as AlexNet and VGGNet
  • In the classical plain network, each layer of processing depends only on the output of the previous layer.

However, in residual networks, each module fi() in the residual network is fed data from a mixture of 2^(i-1) different distributions generated from every possible configuration of the previous i-1 residual modules.

2. Analytic Results

2.1. Deleting individual layers from neural networks at test time

Deleting a layer in residual networks at test time (a) is equivalent to zeroing half of the paths.

When a layer is removed, the number of paths is reduced from 2^n to 2^(n-1), leaving half the number of paths valid.

Figure 3: Deleting individual layers from VGGNet/ResNet on Left: CIFAR-10, Right: ImageNet.

Deleting any layer in VGGNet reduces performance to chance levels. Surprisingly, this is NOT the case for ResNet.

2.2. Deleting/Reordering many modules from residual networks at test-time

(a) Error increases smoothly when randomly deleting several modules from a residual network. (b) Error also increases smoothly when re-ordering a residual network by shuffling building blocks.
  • One key characteristic of ensembles is that their performance depends smoothly on the number of members.

(a): Deleting increasing numbers of residual modules, increases error smoothly. This implies residual networks behave like ensembles.

(b): k randomly sampled pairs of building blocks with compatible dimensionality are swapped. As corruption increases, the error smoothly increases as well.

2.3. The importance of short paths in residual networks

(a) The distribution of all possible path lengths. This follows a Binomial distribution. (b) How much gradient is induced on the first layer of the network through paths of varying length, which appears to decay roughly exponentially with the number of modules the gradient passes through. (c) multiply these two functions (a) & (b), see how much gradient comes from all paths of a certain length.

Surprisingly, almost all of the gradient updates during training come from paths between 5 and 17 modules long.

These are the effective paths, even though they constitute only 0.45% of all paths through this network. Moreover, in comparison to the total length of the network, the effective paths are relatively shallow.

2.4. Stochastic Depth

Left: Fraction of paths remaining after deleting individual layers. Right: Impact of Stochastic Depth on resilience to layer deletion.
  • Removing residual modules mostly removes long paths.

Left: Even after deleting 10 residual modules, many of the effective paths between 5 and 17 modules long are still valid.

Right: A random subset of the residual modules is selected for each mini-batch during training. Training with Stochastic Depth improves resilience slightly.

  • Right now, there are many models proposed with much fewer number of weight layers, i.e. shallower.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.