Review — RepVGG: Making VGG-style ConvNets Great Again
RepVGG, A Plain Network, Outperforms RegNetX, EfficientNet, ResNeXt, ResNet
RepVGG: Making VGG-style ConvNets Great Again
RepVGG, by Tsinghua University, MEGVII Technology, Hong Kong University of Science and Technology, and Aberystwyth University
2021 CVPR, Over 200 Citations (Sik-Ho Tsang @ Medium)
Image Classification, VGGNet, ResNet
Outline
- Problems of Multi-Branch Models
- RepVGG
- Experimental Results
1. Problems of Multi-Branch Models
1.1. Speed
- The theoretical computational density of 3×3 conv is around 4× as the others, suggesting the total theoretical FLOPs is not a comparable proxy for the actual speed among different architectures.
- For example, VGG-16 has 8.4× FLOPs as EfficientNet-B3 but runs 1.8× faster on 1080Ti.
- However, multi-branch topology is widely adopted in Inception and auto-generated architectures, multiple small operators are used instead of a few large ones.
- The number of fragmented operators in NASNet-A is 13 which is unfriendly to devices with strong parallel computing powers like GPU.
1.2. Memory
- The multi-branch topology is memory-inefficient because the results of every branch need to be kept until the addition or concatenation, significantly raising the peak value of memory occupation.
- The above figure shows that the input to a residual block need to be kept until the addition. Assuming the block maintains the feature map size, the peak value of extra memory occupation is 2× as the input.
2. RepVGG
2.1. Overall
- (a) ResNet: It got multi-path topology during both training and inference, which make it slow and memory-inefficient.
- (b) RepVGG Training: It got multi-path topology only during training.
- (c) RepVGG Inference: It got single-path topology only during inference, which can have fast inference time.
2.2. Training-time Multi-branch Architecture
- With multiple branches, an explanation for the success of ResNets is that such a multi-branch architecture makes the model an implicit ensemble of numerous shallower models.
- Specifically, with n blocks, the model can be interpreted as an ensemble of 2^n models, since every block branches the flow into two paths.
- Since the multi-branch topology has drawbacks for inference but the branches seem beneficial to training, multiple branches are used to make an only-training-time ensemble of numerous models.
RepVGG use ResNet-like identity (only if the dimensions match) and 1×1 branches so that the training-time information flow of a building block is y=x+g(x)+f(x), as in (b).
- The model becomes an ensemble of 3^n members with n such blocks.
2.3. Re-param for Plain Inference-time Model
- Note that BN is used in each branch before the addition.
- Let W(3) of the size C2×C1×3×3 to denote the kernel of a 3×3 conv layer with C1 input channels and C2 output channels, and W(1) of the size C2×C1 for the kernel of 1×1 branch.
- μ(3), σ(3), γ(3), β(3) are the accumulated mean, standard deviation and learned scaling factor and bias of the BN layer following 3×3 conv.
- μ(1), σ(1), γ(1), β(1) are similar for the parameters of the BN following 1×1 conv, and μ(0),(0), γ(0), β(0) for the identity branch.
- Let M(1) is with the size of N×C1×H1×W1, and M(2) is with the size of N×C2×H2×W2, which are the input and output, respectively, and let * be the convolution operator.
- If C1=C2, H1=H2, W1=W2, we got:
- where bn is the inference-time BN function:
2.3.1. BN Merging With Conv
- Every BN and its preceding conv layer are first converted into a conv with a bias vector. Let {W′, b′} be the kernel and bias after conversion:
- Then the inference-time bn becomes:
2.3.2. Merging All Branches
- This transformation also applies to the identity branch because an identity can be viewed as a 1×1 conv with an identity matrix as the kernel.
- After such transformations, we will have one 3×3 kernel, two 1×1 kernels, and three bias vectors.
- Then we obtain the final bias by adding up the three bias vectors.
- And the final 3×3 kernel by adding the 1×1 kernels onto the central point of 3×3 kernel, which can be easily implemented by first zero-padding the two 1×1 kernels to 3×3 and adding the three kernels up, as shown in the figure above.
2.4. Architectural Specification
- The 3×3 layers into 5 stages, and the first layer of a stage downsamples with the stride=2. For image classification, global average pooling followed by a fully-connected layer are used as the head. For other tasks, the task-specific heads can be used on the features produced by any layer.
- The five stages have 1, 2, 4, 14, 1 layers respectively to construct an instance named RepVGG-A.
- A deeper RepVGG-B is built, which has 2 more layers in stage 2, 3 and 4.
- Different variants are produced by using different a and b.
- Multiplier a is used to scale the first four stages and b is used for the last stage, with b>a.
- To further reduce the parameters and computations, an interleave groupwise 3×3 conv layers is used with dense ones to trade accuracy for efficiency. Specifically, the number of groups g are set for the 3rd, 5th, 7th, …, 21st layer of RepVGG-A and the additional 23rd, 25th and 27th layers of RepVGG-B. For the simplicity, g is set as 1, 2, or 4 globally for such layers without layer-wise tuning.
3. Experimental Results
3.1. RepVGG for ImageNet Classification
- RepVGG-A0 is 1.25% and 33% better than ResNet-18 in terms of accuracy and speed, RepVGGA1 is 0.29%/64% better than ResNet-34, RepVGG-A2 is 0.17%/83% better than ResNet-50.
- With interleaved groupwise layers (g2/g4), the RepVGG models are further accelerated with reasonable accuracy decrease: RepVGG-B1g4 is 0.37%/101% better than ResNet-101, and RepVGGB1g2 is impressively 2.66× as fast as ResNet-152 with the same accuracy.
- Though the number of parameters is not the primary concern, all the above RepVGG models are more parameter-efficient than ResNets.
- Compared to the classic VGG-16, RepVGG-B2 has only 58% parameters, runs 10% faster and shows 6.57% higher accuracy.
- RepVGG-A2 is 1.37%/59% better than EfficientNet-B0, RepVGG-B1 performs 0.39% better than RegNetX-3.2GF and runs slightly faster.
- Notably, RepVGG models reach above 80% accuracy with 200 epochs.
3.2. Ablation Study
- With both branches removed, the training-time model degrades into an ordinary plain model and only achieves 72.39% accuracy.
- The accuracy is lifted to 73.15% with 1×1 or 74.79% with identity.
- The accuracy of the full featured RepVGGB0 is 75.14%, which is 2.75% higher than the ordinary plain model.
3.3. Semantic Segmentation
- PSPNet framework is used with modifications.
- The modified PSPNets run slightly faster than the ResNet-50/101-backbone counterparts.
- RepVGG backbones outperform ResNet-50 and ResNet-101 by 1.71% and 1.01% respectively in mean IoU with higher speed, and RepVGG-B1g2-fast outperforms the ResNet-101 backbone by 0.37 in mIoU and runs 62% faster.
By reparameterization, 3-branch network module becomes plain network module, which can have faster inference time.
Reference
[2021 CVPR] [RepVGG]
RepVGG: Making VGG-style ConvNets Great Again
Image Classification
1989 … 2021 [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer] [CaiT] [ResMLP] [ResNet-RS] [NFNet] [PVT, PVTv1] [CvT] [HaloNet] [TNT] [CoAtNet] [Focal Transformer] [TResNet] [CPVT] [Twins] [Exemplar-v1, Exemplar-v2] [RepVGG] 2022 [ConvNeXt] [PVTv2]