# Review — RepVGG: Making VGG-style ConvNets Great Again

## RepVGG, A Plain Network, Outperforms RegNetX, EfficientNet, ResNeXt, ResNet

RepVGG: Making VGG-style ConvNets Great Again, by Tsinghua University, MEGVII Technology, Hong Kong University of Science and Technology, and Aberystwyth University

RepVGG2021 CVPR, Over 200 Citations(Sik-Ho Tsang @ Medium)

Image Classification, VGGNet, ResNet

# Outline

**Problems of Multi-Branch Models****RepVGG****Experimental Results**

**1. Problems of Multi-Branch Models**

## 1.1. Speed

- The theoretical computational density of 3×3 conv is around 4× as the others, suggesting
**the total theoretical FLOPs is not a comparable proxy for the actual speed**among different architectures. - For example, VGG-16 has 8.4× FLOPs as EfficientNet-B3 but runs 1.8× faster on 1080Ti.
- However
**, multi-branch topology is widely adopted**in Inception and auto-generated architectures, multiple small operators are used instead of a few large ones. - The number of fragmented operators in
**NASNet****-A**is 13 which is**unfriendly to**devices with strong parallel computing powers like**GPU**.

## 1.2. Memory

- The
**multi-branch topology**is**memory-inefficient**because**the results of every branch need to be kept until the addition or concatenation**, significantly raising the peak value of memory occupation. - The above figure shows that
**the input to a residual block need to be kept until the addition.**Assuming the block maintains the feature map size,**the peak value of extra memory occupation is 2× as the input**.

**2. RepVGG**

## 2.1. Overall

**(a)****ResNet****:**It got multi-path topology during both training and inference, which make it**slow**and**memory-inefficient**.**(b) RepVGG Training**: It got multi-path topology only during training.**(c) RepVGG Inference**: It got single-path topology only during inference, which can have**fast inference time**.

## 2.2. Training-time Multi-branch Architecture

- With multiple branches, an explanation for the success of
**ResNet****s**is that such a**multi-branch architecture makes the model an implicit ensemble of numerous shallower models.** - Specifically,
**with**, the model can be*n*blocks**interpreted as an ensemble of 2^**, since every block branches the flow into two paths.*n*models - Since the multi-branch topology has
**drawbacks for inference**but the branches seem**beneficial to training**, multiple branches are used to make an only-training-time ensemble of numerous models.

RepVGG use ResNet-like identity (only if the dimensions match) and 1×1 branches so that the

training-time information flow of a building blockis, as in (b).y=x+g(x)+f(x)

- The model becomes an ensemble of 3^
*n*members with*n*such blocks.

## 2.3. Re-param for Plain Inference-time Model

- Note that BN is used in each branch before the addition.
- Let
of the size*W*(3)*C*2×*C*1×3×3 to denote the**kernel of a 3×3 conv layer**withand*C*1 input channels, and*C*2 output channelsof the size*W*(1)*C*2×*C*1 for the**kernel of 1×1 branch**. are the accumulated mean, standard deviation and learned scaling factor and bias of the*μ*(3),*σ*(3),*γ*(3),*β*(3)**BN****layer following 3×3 conv**.are similar for the parameters of the*μ*(1),*σ*(1),*γ*(1),*β*(1)**BN****following 1×1 conv**, and μ(0),(0),*γ*(0),*β*(0) for the identity branch.- Let
is with the size of*M*(1)*N*×*C*1×*H*1×*W*1, andis with the size of*M*(2)*N*×*C*2×*H*2×*W*2, which are the**input**and**output**, respectively, and let*****be the**convolution operator**. - If
*C*1=*C*2,*H*1=*H*2,*W*1=*W*2, we got:

- where
is the*bn***inference-time****BN**

## 2.3.1. BN Merging With Conv

**Every****BN****and its preceding conv layer are first converted into a conv with a bias vector.**Let {*W*′,*b*′} be the kernel and bias after conversion:

- Then the inference-time
*bn*becomes:

## 2.3.2. Merging All Branches

- This transformation also applies to the identity branch because an identity can be viewed as a 1×1 conv with an identity matrix as the kernel.
- After such transformations, we will have one 3×3 kernel, two 1×1 kernels, and three bias vectors.
- Then we obtain the
**final bias**by**adding up the three bias vectors**. - And the
**final 3×3 kernel by adding the 1×1 kernels onto the central point of 3×3 kernel**, which can be easily implemented by first zero-padding the two 1×1 kernels to 3×3 and adding the three kernels up, as shown in the figure above.

## 2.4. Architectural Specification

- The 3×3 layers into
**5 stages**, and the first layer of a stage downsamples with the stride=2. For image classification, global average pooling followed by a fully-connected layer are used as the head. For other tasks, the task-specific heads can be used on the features produced by any layer. - The five stages have 1, 2, 4, 14, 1 layers respectively to construct an instance named
**RepVGG-A**. - A deeper
**RepVGG-B**is built, which has 2 more layers in stage 2, 3 and 4.

**Different variants**are produced by using different*a*and*b*.**Multiplier**is used to*a***scale the first four stages**andis used for the*b***last stage**, with*b*>*a*.- To further reduce the parameters and computations, an interleave groupwise 3×3 conv layers is used with dense ones to trade accuracy for efficiency. Specifically, the
**number of groups**are set for the*g***3rd, 5th, 7th, …, 21st layer of RepVGG-A**and the additional**23rd, 25th and 27th layers of RepVGG-B**. For the simplicity,*g*is set as**1, 2, or 4**globally for such layers without layer-wise tuning.

# 3. Experimental Results

## 3.1. RepVGG for ImageNet Classification

**RepVGG-A0 is 1.25% and 33% better than****ResNet****-18 in terms of accuracy and speed**, RepVGGA1 is 0.29%/64% better than ResNet-34, RepVGG-A2 is 0.17%/83% better than ResNet-50.**With interleaved groupwise layers (g2/g4), the RepVGG models are further accelerated with reasonable accuracy decrease**: RepVGG-B1g4 is 0.37%/101% better than ResNet-101, and RepVGGB1g2 is impressively 2.66× as fast as ResNet-152 with the same accuracy.- Though the number of parameters is not the primary concern, all the above
**RepVGG models are more parameter-efficient than****ResNet****s.** **Compared to the classic****VGG****-16, RepVGG-B2 has only 58% parameters, runs 10% faster and shows 6.57% higher accuracy**.

**RepVGG-A2 is 1.37%/59% better than****EfficientNet****-B0**,**RepVGG-B1 performs 0.39% better than RegNetX-3.2GF**and runs slightly faster.- Notably, RepVGG models reach above 80% accuracy with 200 epochs.

## 3.2. Ablation Study

- With both branches removed, the training-time model degrades into an ordinary plain model and only achieves 72.39% accuracy.
- The accuracy is lifted to 73.15% with 1×1 or 74.79% with identity.
- The accuracy of the
**full featured RepVGGB0 is 75.14%**, which is**2.75% higher than the ordinary plain model.**

## 3.3. Semantic Segmentation

- PSPNet framework is used with modifications.
- The modified PSPNets run slightly faster than the ResNet-50/101-backbone counterparts.
- RepVGG backbones outperform ResNet-50 and ResNet-101 by 1.71% and 1.01% respectively in mean IoU with higher speed, and RepVGG-B1g2-fast outperforms the ResNet-101 backbone by 0.37 in mIoU and runs 62% faster.

By reparameterization, 3-branch network module becomes plain network module, which can have faster inference time.

## Reference

[2021 CVPR] [RepVGG]

RepVGG: Making VGG-style ConvNets Great Again

## Image Classification

**1989** … **2021** [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer] [CaiT] [ResMLP] [ResNet-RS] [NFNet] [PVT, PVTv1] [CvT] [HaloNet] [TNT] [CoAtNet] [Focal Transformer] [TResNet] [CPVT] [Twins] [Exemplar-v1, Exemplar-v2] [RepVGG] **2022 **[ConvNeXt] [PVTv2]