# Brief Review — Residual Networks Behave Like Ensembles of Relatively Shallow Networks

## Removing Residual Modules Leads to Insignificant Loss Only

Residual Networks Behave Like Ensembles of Relatively Shallow Networks,Veit NIPS’16, by Cornell University,2016 NIPS, Over 1000 Citations(Sik-Ho Tsang @ Medium)

Image Classification1989 … 2022[ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet]2023[Vision Permutator (ViP)]

==== My Other Paper Readings Are Also Over Here ====

**Residual networks**can be rewritten as**an explicit collection of paths**.- These paths show
**ensemble-like behavior**in the sense that they do not strongly depend on each other.

# Outline

**Residual Network Has Ensemble-Like Behavior****Analytic Results**

**1. Residual Network Has Ensemble-Like Behavior**

## 1.1. Preliminaries

**Residual module in****ResNet**

- where
*fi*() is the sequential operation of convolution (*W*), batch norm (*B*) and ReLU (*σ*):

## 1.2. Ensemble-Like Behavior

- The output of each stage is based on the combination of two sub-terms.

Thus, the

shared structure of the residual networkapparent byunrolling the recursion into an exponential number of nested terms, expanding one layer at each substitution step:

The graph makes clear that

data flows along many paths from input to output.Each path is a unique configuration of which residual module to enter and which to skip.

## 1.3. Comparison to Plain Network

- In the classical plain network,
**each layer of processing depends only on the output of the previous layer**.

However, in residual networks,

each modulein the residual network is fedfi()data from a mixture of 2^(generated from every possible configuration of thei-1) different distributionspreviousi-1 residual modules.

**2. Analytic Results**

## 2.1. Deleting individual layers from neural networks at test time

When a layer is removed, the number of paths is reduced from 2^leaving half the number of paths valid.nto 2^(n-1),

Deleting any layer inVGGNetreduces performanceto chance levels. Surprisingly, this isNOT the case forResNet.

## 2.2. Deleting/Reordering many modules from residual networks at test-time

- One key characteristic of
**ensembles**is that their**performance depends smoothly on the number of members**.

(a):Deleting increasing numbers of residual modules, increases error smoothly.This implies residual networksbehave like ensembles.

(b):k randomly sampled pairs of building blocks with compatible dimensionality are swapped. Ascorruption increases, theerror smoothly increasesas well.

## 2.3. The importance of short paths in residual networks

Surprisingly,

almost all of the gradient updatesduring training come from pathsbetween 5 and 17 moduleslong.These are the

effective paths, even though they constitute only 0.45% of all paths through this network. Moreover, in comparison to the total length of the network, the effective paths arerelatively shallow.

## 2.4. **Stochastic Depth**

- Removing residual modules mostly removes long paths.

Left: Even afterdeleting 10 residual modules,many of the effective paths between 5 and 17 moduleslong are still valid.

Right: A random subset of the residual modules is selected for each mini-batch during training.Training withStochastic Depthimproves resilience slightly.

- (Please feel free to read Stochastic Depth if interested.)

- Right now, there are
**many models**proposed with much**fewer number of weight layers**, i.e.**shallower.**