Review — RepVGG: Making VGG-style ConvNets Great Again

RepVGG, A Plain Network, Outperforms RegNetX, EfficientNet, ResNeXt, ResNet

7 min readJul 21, 2022

**Top-1 accuracy on ImageNet vs. actual speed**

RepVGG: Making VGG-style ConvNets Great Again
RepVGG, by Tsinghua University, MEGVII Technology, Hong Kong University of Science and Technology, and Aberystwyth University
2021 CVPR, Over 200 Citations (Sik-Ho Tsang @ Medium)
Image Classification, VGGNet, ResNet

By using structural re-parameterization technique, RepVGG architecture is like a ResNet during training and it is like a VGGNet during inference, as shown above.
Thus, during inference, RepVGG becomes a network with a stack of 3×3 convolution and ReLU, which have fast inference time.

Outline

Problems of Multi-Branch Models
RepVGG
Experimental Results

1. Problems of Multi-Branch Models

1.1. Speed

**Speed Test with Varying Kernel Size on GTX 1080 Ti**

The theoretical computational density of 3×3 conv is around 4× as the others, suggesting the total theoretical FLOPs is not a comparable proxy for the actual speed among different architectures.
For example, VGG-16 has 8.4× FLOPs as EfficientNet-B3 but runs 1.8× faster on 1080Ti.
However, multi-branch topology is widely adopted in Inception and auto-generated architectures, multiple small operators are used instead of a few large ones.
The number of fragmented operators in NASNet-A is 13 which is unfriendly to devices with strong parallel computing powers like GPU.

1.2. Memory

**Peak memory occupation in residual and plain model**

The multi-branch topology is memory-inefficient because the results of every branch need to be kept until the addition or concatenation, significantly raising the peak value of memory occupation.
The above figure shows that the input to a residual block need to be kept until the addition. Assuming the block maintains the feature map size, the peak value of extra memory occupation is 2× as the input.

2. RepVGG

2.1. Overall

**Different Architectures are Used During Training and Inference for RepVGG**

(a) ResNet: It got multi-path topology during both training and inference, which make it slow and memory-inefficient.
(b) RepVGG Training: It got multi-path topology only during training.
(c) RepVGG Inference: It got single-path topology only during inference, which can have fast inference time.

2.2. Training-time Multi-branch Architecture

With multiple branches, an explanation for the success of ResNets is that such a multi-branch architecture makes the model an implicit ensemble of numerous shallower models.
Specifically, with n blocks, the model can be interpreted as an ensemble of 2^n models, since every block branches the flow into two paths.
Since the multi-branch topology has drawbacks for inference but the branches seem beneficial to training, multiple branches are used to make an only-training-time ensemble of numerous models.

RepVGG use ResNet-like identity (only if the dimensions match) and 1×1 branches so that the training-time information flow of a building block is y=x+g(x)+f(x), as in (b).

The model becomes an ensemble of 3^n members with n such blocks.

2.3. Re-param for Plain Inference-time Model

**Structural re-parameterization of a RepVGG block**

Note that BN is used in each branch before the addition.
Let W(3) of the size C2×C1×3×3 to denote the kernel of a 3×3 conv layer with C1 input channels and C2 output channels, and W(1) of the size C2×C1 for the kernel of 1×1 branch.
μ(3), σ(3), γ(3), β(3) are the accumulated mean, standard deviation and learned scaling factor and bias of the BN layer following 3×3 conv.
μ(1), σ(1), γ(1), β(1) are similar for the parameters of the BN following 1×1 conv, and μ(0),(0), γ(0), β(0) for the identity branch.
Let M(1) is with the size of N×C1×H1×W1, and M(2) is with the size of N×C2×H2×W2, which are the input and output, respectively, and let * be the convolution operator.
If C1=C2, H1=H2, W1=W2, we got:

where bn is the inference-time BN function:

2.3.1. BN Merging With Conv

Every BN and its preceding conv layer are first converted into a conv with a bias vector. Let {W′, b′} be the kernel and bias after conversion:

Then the inference-time bn becomes:

2.3.2. Merging All Branches

This transformation also applies to the identity branch because an identity can be viewed as a 1×1 conv with an identity matrix as the kernel.
After such transformations, we will have one 3×3 kernel, two 1×1 kernels, and three bias vectors.
Then we obtain the final bias by adding up the three bias vectors.
And the final 3×3 kernel by adding the 1×1 kernels onto the central point of 3×3 kernel, which can be easily implemented by first zero-padding the two 1×1 kernels to 3×3 and adding the three kernels up, as shown in the figure above.

2.4. Architectural Specification

**Architectural specification of RepVGG. Here 2×64a means stage2 has 2 layers each with 64a channels**

The 3×3 layers into 5 stages, and the first layer of a stage downsamples with the stride=2. For image classification, global average pooling followed by a fully-connected layer are used as the head. For other tasks, the task-specific heads can be used on the features produced by any layer.
The five stages have 1, 2, 4, 14, 1 layers respectively to construct an instance named RepVGG-A.
A deeper RepVGG-B is built, which has 2 more layers in stage 2, 3 and 4.

**RepVGG models defined by multipliers a and b**

Different variants are produced by using different a and b.
Multiplier a is used to scale the first four stages and b is used for the last stage, with b>a.
To further reduce the parameters and computations, an interleave groupwise 3×3 conv layers is used with dense ones to trade accuracy for efficiency. Specifically, the number of groups g are set for the 3rd, 5th, 7th, …, 21st layer of RepVGG-A and the additional 23rd, 25th and 27th layers of RepVGG-B. For the simplicity, g is set as 1, 2, or 4 globally for such layers without layer-wise tuning.

3. Experimental Results

3.1. RepVGG for ImageNet Classification

**Results trained on ImageNet with simple data augmentation in 120 epochs**

RepVGG-A0 is 1.25% and 33% better than ResNet-18 in terms of accuracy and speed, RepVGGA1 is 0.29%/64% better than ResNet-34, RepVGG-A2 is 0.17%/83% better than ResNet-50.
With interleaved groupwise layers (g2/g4), the RepVGG models are further accelerated with reasonable accuracy decrease: RepVGG-B1g4 is 0.37%/101% better than ResNet-101, and RepVGGB1g2 is impressively 2.66× as fast as ResNet-152 with the same accuracy.
Though the number of parameters is not the primary concern, all the above RepVGG models are more parameter-efficient than ResNets.
Compared to the classic VGG-16, RepVGG-B2 has only 58% parameters, runs 10% faster and shows 6.57% higher accuracy.

**Results on ImageNet trained in 200 epochs with Autoaugment, label smoothing and mixup**

RepVGG-A2 is 1.37%/59% better than EfficientNet-B0, RepVGG-B1 performs 0.39% better than RegNetX-3.2GF and runs slightly faster.
Notably, RepVGG models reach above 80% accuracy with 200 epochs.

3.2. Ablation Study

**Ablation studies with 120 epochs on RepVGG-B0**

With both branches removed, the training-time model degrades into an ordinary plain model and only achieves 72.39% accuracy.
The accuracy is lifted to 73.15% with 1×1 or 74.79% with identity.
The accuracy of the full featured RepVGGB0 is 75.14%, which is 2.75% higher than the ordinary plain model.

3.3. Semantic Segmentation

**Semantic segmentation on Cityscapes tested on the validation subset**

PSPNet framework is used with modifications.
The modified PSPNets run slightly faster than the ResNet-50/101-backbone counterparts.
RepVGG backbones outperform ResNet-50 and ResNet-101 by 1.71% and 1.01% respectively in mean IoU with higher speed, and RepVGG-B1g2-fast outperforms the ResNet-101 backbone by 0.37 in mIoU and runs 62% faster.

By reparameterization, 3-branch network module becomes plain network module, which can have faster inference time.

Reference

[2021 CVPR] [RepVGG]
RepVGG: Making VGG-style ConvNets Great Again

Image Classification

1989 … 2021 [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer] [CaiT] [ResMLP] [ResNet-RS] [NFNet] [PVT, PVTv1] [CvT] [HaloNet] [TNT] [CoAtNet] [Focal Transformer] [TResNet] [CPVT] [Twins] [Exemplar-v1, Exemplar-v2] [RepVGG] 2022 [ConvNeXt] [PVTv2]

Review — RepVGG: Making VGG-style ConvNets Great Again

RepVGG, A Plain Network, Outperforms RegNetX, EfficientNet, ResNeXt, ResNet

Outline

1. Problems of Multi-Branch Models

1.1. Speed

1.2. Memory

2. RepVGG

2.1. Overall

2.2. Training-time Multi-branch Architecture

2.3. Re-param for Plain Inference-time Model

2.3.1. BN Merging With Conv

2.3.2. Merging All Branches

2.4. Architectural Specification

3. Experimental Results

3.1. RepVGG for ImageNet Classification

3.2. Ablation Study

3.3. Semantic Segmentation

Reference

Image Classification

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet