Review — Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs
RepLKNet: Using Up To 31×31 Large Kernel Convolutions
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs,
RepLKNet, by Tsinghua University, MEGVII Technology, and Aberystwyth University,
2022 CVPR, Over 130 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT, Swin Transformer1.1. Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- Five model design guidelines are suggested, e.g., applying re-parameterized large depth-wise convolutions, to design efficient high-performance large-kernel CNNs.
- Following the guidelines, RepLKNet, a pure CNN architecture, is proposed whose kernel size is as large as 31×31, in contrast to commonly used 3×3.
- RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer.
Outline
- RepLKNet: Five Guidelines
- RepLKNet Model Architecture
- Results
1. RepLKNet: Five Guidelines
1.1. Guideline 1: Large Depth-Wise (DW) Convolutions can be Efficient in Practice
- Large-kernel convolutions are computationally expensive. The drawback can be greatly overcome by applying depth-wise (DW) convolutions (MobileNetV1).
- For example, in the proposed RepLKNet, increasing the kernel sizes in different stages from [3, 3, 3, 3] to [31, 29, 27, 13] only increases the FLOPs and number of parameters by 18.6% and 10.4% respectively, which is acceptable. The remaining 1×1 convolutions actually dominate most of the complexity.
Off-the-shelf deep learning tools (such as: PyTorch) support large DW convolutions poorly.
- Block-wise (inverse) implicit gemm algorithm is a better choice.
- The implementation has been integrated into the open-sourced framework MegEngine. Authors also propose an efficient PyTorch implementation, which is far more efficient.
With the optimization, the latency contribution of DW convolutions in RepLKNet reduces from 49.5% to 12.3%, which is roughly in proportion to the FLOPs occupation.
1.2. Guideline 2: Identity Shortcut is Vital Especially for Networks With Very Large Kernels.
- All the DW 3×3 layers are simply replaced with 13×13. Large kernels improve the accuracy of MobileNetV2 with shortcuts by 0.77%. However, without shortcuts, large kernels reduce the accuracy to only 53.98%.
Shortcuts make the model an implicit ensemble composed of numerous models with different receptive fields (RFs), so it can benefit from a much larger maximum RF while not losing the ability to capture small-scale patterns.
1.3. Re-Parameterizing (RepVGG) with Small Kernels Helps to Make Up the Optimization Issue
- The 3×3 layers of MobileNetV2 are replaced by 9×9 and 13×13 respectively, and optionally adopt Structural Reparameterization (RepVGG) methodology.
- Specifically, a 3×3 layer is constructed parallel to the large one, then their outputs are added up after Batch normalization (BN) layers, as shown in the above figure.
After training, the small kernel as well as BN parameters are merged into the large kernel, so the resultant model is equivalent to the model for training but no longer has small kernels.
The above table shows directly increasing the kernel size from 9 to 13 reduces the accuracy, while re-parameterization addresses the issue.
1.4. Guideline 4: Large Convolutions Boost Downstream Tasks Much More Than ImageNet Classification
- Increasing the kernel size of MobileNetV2 from 3×3 to 9×9 improves the ImageNet accuracy by 1.33% but the Cityscapes mIoU by 3.99%, where large ERF is more important for semantic segmentation tasks.
- Therefore, large kernel design significantly increases the ERFs.
- Also, large kernel design contributes more shape biases to the network. Similar to human, humans recognize objects mainly based on shape cue rather than texture, therefore a model with stronger shape bias may transfer better to downstream tasks.
1.5. Guideline 5: Large Kernel (e.g., 13×13) is Useful Even on Small Feature Maps (e.g., 7×7)
- DW convolutions are enlarged in the last stage of MobileNetV2 to 7×7 or 13×13.
- Although convolutions in the last stage already involve very large receptive field, further increasing the kernel sizes still leads to performance improvements.
As illustrated in the above figure, two outputs at adjacent spatial locations share only a fraction of the kernel weights. Large kernels not only help to learn the relative positions between concepts, but also encode the absolute position information due to padding effect.
2. RepLKNet Model Architecture
2.1. Stem
- After the first 3×3 with 2 downsampling, a DW 3×3 layer is arranged to capture low-level patterns, a 1×1 conv, and another DW 3×3 layer for downsampling.
2.2. Stages 1–4
- Each contains several RepLK Blocks, which use shortcuts (Guideline 2) and DW large kernels (Guideline 1).
- 1×1 conv is used before and after DW conv as a common practice.
- Each DW large conv uses a 5×5 kernel for re-parameterization (Guideline 3).
- 1×1 layers are used to increase the depth. Inspired by the feed forward network (FFN) layer in Transformer layer, a similar CNN-style block is used, which is composed of shortcut, BN, two 1×1 layers and GELU, so it is referred to as ConvFFN Block.
- As a common practice, the number of internal channels of the ConvFFN Block is 4× as the input. Simply following ViT and Swin, which interleave attention and FFN blocks, a ConvFFN is placed after each RepLK Block.
2.3. Transition Blocks
- Transition blocks are placed between stages, which first increase the channel dimension via 1×1 conv, and then conduct 2× downsampling with DW 3×3 conv.
Each stage has three architectural hyper-parameters: the number of RepLK Blocks B, the channel dimension C, and the kernel size K, so that a RepLKNet architecture is defined by [B1, B2, B3, B4], [C1, C2, C3, C4], [K1, K2, K3, K4].
2.4. Model Variants
By fixing B=[2, 2, 18, 2], C=[128, 256, 512, 1024], and varying K, the kernel sizes are casually set as [13, 13, 13, 13], [25, 25, 25, 13], [31, 29, 27, 13], respectively, and refer to the models as RepLKNet-13/25/31.
Two small-kernel baselines are also constructed where the kernel sizes are all 3 or 7 (RepLKNet-3/7).
3. Results
3.1. Image Classification on ImageNet
With only ImageNet-1K training, RepLKNet-31B reaches 84.8% accuracy, which is 0.3% higher than Swin-B, and runs 43% faster.
Even though RepLKNet-XL has higher FLOPs than Swin-L, it runs faster.
3.2. Semantic Segmentation on Cityscapes & ADE20K
- The pretrained models are used as the backbones of UPerNet.
On Cityscapes, ImageNet-1K-pretrained RepLKNet-31B outperforms Swin-B by a significant margin.
On ADE20K, RepLKNet-31B outperforms Swin-B with both 1K and 22K pretraining, and the margins of single-scale mIoU are particularly significant. Pretrained with the proposed semi-supervised dataset MegData73M, RepLKNet-XL achieves an mIoU of 56.0, which shows feasible scalability towards large-scale vision applications.
3.3. Object Detection on COCO
- RepLKNets are used as the backbone of FCOS and Cascade R-CNN.
RepLKNets outperform ResNeXt-101–64×4d by up to 4.4 mAP while have fewer parameters and lower FLOPs.
- (The results may be further improved with the advanced techniques like HTC [12], HTC++ [61], Soft-NMS [7] or a 6× (72-epoch) schedule.)
Compared to Swin, RepLKNets achieve higher or comparable mAP with fewer parameters and lower FLOPs. Notably, RepLKNet-XL achieves an mAP of 55.5, which demonstrates the scalability again.
3.4. Further Study
- The contribution of the corresponding pixel on the input image to the central point of the feature map produced by the last layer, is shown above.
- The high-contribution pixels of ResNet-101 gather around the central point, but the outer points have very low contributions, indicating a limited ERF. ResNet-152 shows a similar pattern, suggesting the more 3×3 layers do not significantly increase the ERF.
The high-contribution pixels by RepLKNet-13 are more evenly distributed, suggesting RepLKNet-13 attends to more outer pixels.
- High-contribution area ratio r of a minimum rectangle that covers the contribution scores over a given threshold t is measured.
Area ratio of RepLKNet-31 is 98.6%, which means most of pixels considerably contribute to the final predictions.
RepLKNet has higher shape bias than Swin while Higher shape bias is closely related to the ERF.
Replacing the 7×7 convolutions in ConvNeXt by kernels as large as 31×31 brings significant improvements, e.g., ConvNeXt-Tiny + large kernel > ConvNeXt-Small , and ConvNeXt-Small + large kernel > ConvNeXt-Base.