Review — RepVGG: Making VGG-style ConvNets Great Again

RepVGG, A Plain Network, Outperforms RegNetX, EfficientNet, ResNeXt, ResNet

Top-1 accuracy on ImageNet vs. actual speed

RepVGG: Making VGG-style ConvNets Great Again
RepVGG
, by Tsinghua University, MEGVII Technology, Hong Kong University of Science and Technology, and Aberystwyth University
2021 CVPR, Over 200 Citations (Sik-Ho Tsang @ Medium)
Image Classification, VGGNet, ResNet

  • By using structural re-parameterization technique, RepVGG architecture is like a ResNet during training and it is like a VGGNet during inference, as shown above.
  • Thus, during inference, RepVGG becomes a network with a stack of 3×3 convolution and ReLU, which have fast inference time.

Outline

  1. Problems of Multi-Branch Models
  2. RepVGG
  3. Experimental Results

1. Problems of Multi-Branch Models

1.1. Speed

Speed Test with Varying Kernel Size on GTX 1080 Ti
  • The theoretical computational density of 3×3 conv is around 4× as the others, suggesting the total theoretical FLOPs is not a comparable proxy for the actual speed among different architectures.
  • For example, VGG-16 has 8.4× FLOPs as EfficientNet-B3 but runs 1.8× faster on 1080Ti.
  • However, multi-branch topology is widely adopted in Inception and auto-generated architectures, multiple small operators are used instead of a few large ones.
  • The number of fragmented operators in NASNet-A is 13 which is unfriendly to devices with strong parallel computing powers like GPU.

1.2. Memory

Peak memory occupation in residual and plain model
  • The multi-branch topology is memory-inefficient because the results of every branch need to be kept until the addition or concatenation, significantly raising the peak value of memory occupation.
  • The above figure shows that the input to a residual block need to be kept until the addition. Assuming the block maintains the feature map size, the peak value of extra memory occupation is 2× as the input.

2. RepVGG

2.1. Overall

Different Architectures are Used During Training and Inference for RepVGG
  • (a) ResNet: It got multi-path topology during both training and inference, which make it slow and memory-inefficient.
  • (b) RepVGG Training: It got multi-path topology only during training.
  • (c) RepVGG Inference: It got single-path topology only during inference, which can have fast inference time.

2.2. Training-time Multi-branch Architecture

  • With multiple branches, an explanation for the success of ResNets is that such a multi-branch architecture makes the model an implicit ensemble of numerous shallower models.
  • Specifically, with n blocks, the model can be interpreted as an ensemble of 2^n models, since every block branches the flow into two paths.
  • Since the multi-branch topology has drawbacks for inference but the branches seem beneficial to training, multiple branches are used to make an only-training-time ensemble of numerous models.

RepVGG use ResNet-like identity (only if the dimensions match) and 1×1 branches so that the training-time information flow of a building block is y=x+g(x)+f(x), as in (b).

  • The model becomes an ensemble of 3^n members with n such blocks.

2.3. Re-param for Plain Inference-time Model

Structural re-parameterization of a RepVGG block
  • Note that BN is used in each branch before the addition.
  • Let W(3) of the size CC1×3×3 to denote the kernel of a 3×3 conv layer with C1 input channels and C2 output channels, and W(1) of the size CC1 for the kernel of 1×1 branch.
  • μ(3), σ(3), γ(3), β(3) are the accumulated mean, standard deviation and learned scaling factor and bias of the BN layer following 3×3 conv.
  • μ(1), σ(1), γ(1), β(1) are similar for the parameters of the BN following 1×1 conv, and μ(0),(0), γ(0), β(0) for the identity branch.
  • Let M(1) is with the size of N×CHW1, and M(2) is with the size of N×CHW2, which are the input and output, respectively, and let * be the convolution operator.
  • If C1=C2, H1=H2, W1=W2, we got:
  • where bn is the inference-time BN function:

2.3.1. BN Merging With Conv

  • Every BN and its preceding conv layer are first converted into a conv with a bias vector. Let {W′, b′} be the kernel and bias after conversion:
  • Then the inference-time bn becomes:

2.3.2. Merging All Branches

  • This transformation also applies to the identity branch because an identity can be viewed as a 1×1 conv with an identity matrix as the kernel.
  • After such transformations, we will have one 3×3 kernel, two 1×1 kernels, and three bias vectors.
  • Then we obtain the final bias by adding up the three bias vectors.
  • And the final 3×3 kernel by adding the 1×1 kernels onto the central point of 3×3 kernel, which can be easily implemented by first zero-padding the two 1×1 kernels to 3×3 and adding the three kernels up, as shown in the figure above.

2.4. Architectural Specification

Architectural specification of RepVGG. Here 2×64a means stage2 has 2 layers each with 64a channels
  • The 3×3 layers into 5 stages, and the first layer of a stage downsamples with the stride=2. For image classification, global average pooling followed by a fully-connected layer are used as the head. For other tasks, the task-specific heads can be used on the features produced by any layer.
  • The five stages have 1, 2, 4, 14, 1 layers respectively to construct an instance named RepVGG-A.
  • A deeper RepVGG-B is built, which has 2 more layers in stage 2, 3 and 4.
RepVGG models defined by multipliers a and b
  • Different variants are produced by using different a and b.
  • Multiplier a is used to scale the first four stages and b is used for the last stage, with b>a.
  • To further reduce the parameters and computations, an interleave groupwise 3×3 conv layers is used with dense ones to trade accuracy for efficiency. Specifically, the number of groups g are set for the 3rd, 5th, 7th, …, 21st layer of RepVGG-A and the additional 23rd, 25th and 27th layers of RepVGG-B. For the simplicity, g is set as 1, 2, or 4 globally for such layers without layer-wise tuning.

3. Experimental Results

3.1. RepVGG for ImageNet Classification

Results trained on ImageNet with simple data augmentation in 120 epochs
  • RepVGG-A0 is 1.25% and 33% better than ResNet-18 in terms of accuracy and speed, RepVGGA1 is 0.29%/64% better than ResNet-34, RepVGG-A2 is 0.17%/83% better than ResNet-50.
  • With interleaved groupwise layers (g2/g4), the RepVGG models are further accelerated with reasonable accuracy decrease: RepVGG-B1g4 is 0.37%/101% better than ResNet-101, and RepVGGB1g2 is impressively 2.66× as fast as ResNet-152 with the same accuracy.
  • Though the number of parameters is not the primary concern, all the above RepVGG models are more parameter-efficient than ResNets.
  • Compared to the classic VGG-16, RepVGG-B2 has only 58% parameters, runs 10% faster and shows 6.57% higher accuracy.
Results on ImageNet trained in 200 epochs with Autoaugment, label smoothing and mixup
  • RepVGG-A2 is 1.37%/59% better than EfficientNet-B0, RepVGG-B1 performs 0.39% better than RegNetX-3.2GF and runs slightly faster.
  • Notably, RepVGG models reach above 80% accuracy with 200 epochs.

3.2. Ablation Study

Ablation studies with 120 epochs on RepVGG-B0
  • With both branches removed, the training-time model degrades into an ordinary plain model and only achieves 72.39% accuracy.
  • The accuracy is lifted to 73.15% with 1×1 or 74.79% with identity.
  • The accuracy of the full featured RepVGGB0 is 75.14%, which is 2.75% higher than the ordinary plain model.

3.3. Semantic Segmentation

Semantic segmentation on Cityscapes tested on the validation subset
  • PSPNet framework is used with modifications.
  • The modified PSPNets run slightly faster than the ResNet-50/101-backbone counterparts.
  • RepVGG backbones outperform ResNet-50 and ResNet-101 by 1.71% and 1.01% respectively in mean IoU with higher speed, and RepVGG-B1g2-fast outperforms the ResNet-101 backbone by 0.37 in mIoU and runs 62% faster.

By reparameterization, 3-branch network module becomes plain network module, which can have faster inference time.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store