Brief Review — VanillaNet: the Power of Minimalism in Deep Learning

VanillaNet, 6-Layer Model, More Efficient Than ConvNeXt V2

6 min readDec 10, 2024

VanillaNet: the Power of Minimalism in Deep Learning
VanillaNet, by Huawei Noah’s Ark Lab, and University of Sydney
2023 NeurIPS, Over 100 Citations (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2024 [FasterViT] [CAS-ViT] [TinySaver] [Fast Vision Transformer (FViT)] [MogaNet] [RDNet] [Logarithmic Lenses]
==== My Other Paper Readings Are Also Over Here ====

VanillaNet is proposed, which avoids high depth, shortcuts, and intricate operations like self-attention, wherein each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture.

Outline

VanillaNet
Results

1. VanillaNet

1.1. Model Architecture

For the stem, VanillaNet utilize a 4 × 4 × 3 × C convolutional layer with stride 4 following the popular settings in ResNet, ConvNeXt, to map the images with 3 channels to features with C channels.
At stage 1, 2 and 3, a maxpooling layer with stride 2 is used to decrease the size and feature map and the number of channels is increased by 2.
At stage 4, VanillaNet does not increase the number of channels as it follows an average pooling layer.
The last layer is a fully connected layer to output the classification results.
The kernel size of each convolutional layer is 1 × 1. The activation function is applied (ReLU is used.) after each 1 × 1 convolutional layer. To ease the training procedure of the network, batch normalization is also added after each layer.
For the VanillaNet with different number of layers, blocks are added in each stage.
No shortcut is used.

1.2. Deep Training Strategy

The main idea of deep training strategy is to train two convolutional layers with an activation function instead of a single convolution layer in the beginning of training procedure. The activation function is gradually reduce to an identity mapping with the increasing number of training epochs.
At the end of training, two convolutions can be easily merged into the one convolution to reduce the inference time.

For an activation function A(x) (which can be the usual functions such ReLU and Tanh), it is combined with an identity mapping, which can be formulated as:

where λ=e/E, and the current epoch and the number of deep training epochs are denoted as e and E respectively.
When the training converged, A′(x) = x, which means the two convolutional layers have no activation functions in the middle.

First, every batch normalization layer and its preceding convolution are converted into a single convolution.

The scale, shift, mean and variance in batch normalization are represented as γ, β, μ, σ, respectively. The merged weight and bias matrices are:

With the convolution formulated as:

And the weight matrix of two convolution layers are denoted as W1 and W2. Then, the two convolution without activation function is formulated as:

Therefore, 1 × 1 convolution can merged without increasing the inference speed.

1.3. Series Informed Activation Function

In fact, there are two ways to improve the non-linearity of a neural network: stacking the non-linear activation layers or increase the non-linearity of each activation layer.
In this paper, the serially stacking of activation function is the key idea of deep networks.
With A(x) being the usual functions such ReLU and Tanh, the concurrently stacking of A(x) can be formulated as:

where n denotes the number of stacked activation function and ai, bi is the scale and bias of each activation to avoid simple accumulation. The non-linearity of the activation function can be largely enhanced by concurrently stacking.
To further enrich the approximation ability of the series, the series based function is enabled to learn the global information by varying the inputs from their neighbors. Thus, the activation function is formulated as:

It is easy to see that when n = 0, the series based activation function As(x) degenerates to the plain activation function A(x).
The computation complexity of a convolution is:

The computation cost of its series activation layer is:

Therefore:

Taking the 4th stage in VanillaNet-B as an example, where Cout = 2048, k = 1 and n = 7, the ratio is about 84. In conclusion, the computation cost of the proposed activation function is still much lower than the convolutional layers.

2. Results

2.1. Ablation Studies

When n = 0, the activation function degenerate into the plain ReLU activation function. The network can only achieve a 60.53% top-1 accuracy on the ImageNet dataset.

When n = 1, the network can achieve a 74.53% accuracy, which is a huge improvement compared with 60.53%. n = 3 is a good balance in the top-1 accuracy and latency.

The original VanillaNet achieves a 75.23% top-1 accuracy, which is the baseline. By using the deep training technique, the proposed VanillaNet can achieve a 76.36% accuracy.

By applying the proposed deep training and series activation function, the performance of AlexNet can be largely brought up by about 6%.
When it turns to ResNet-50 whose architecture are relatively complex, the performance gain is little.

Using shortcuts, in spite of any type of shortcuts, has little improvement on the performance of the proposed VanillaNet. This suggests that the bottleneck of vanilla networks is not the identity mapping, but the weak nonlinearity.

3.2. Visualization

For the VanillaNet with only 9 depth, the active region is much larger than that of deep networks.

3.2. SOTA Comparisons

As shown in Table 4, the VanillaNet-9 achieves a 79.87% accuracy with only a 2.91ms inference speed in GPU, which is over 50% faster than the ResNet-50 and ConvNeXt V2-P with similar performance.

The proposed VanillaNet-13–1.5×† achieves an 83.11% Top-1 accuracy on ImageNet.

It is suggested that we may not need deep and complex networks on image classification since scaling up VanillaNets can achieve similar performance with deep networks.