Review — Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

RepLKNet: Using Up To 31×31 Large Kernel Convolutions

Sik-Ho Tsang
7 min readMar 10, 2023

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs,
RepLKNet, by Tsinghua University, MEGVII Technology, and Aberystwyth University,
2022 CVPR, Over 130 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT, Swin Transformer

1.1. Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • Five model design guidelines are suggested, e.g., applying re-parameterized large depth-wise convolutions, to design efficient high-performance large-kernel CNNs.
  • Following the guidelines, RepLKNet, a pure CNN architecture, is proposed whose kernel size is as large as 31×31, in contrast to commonly used 3×3.
  • RepLKNet greatly closes the performance gap between CNNs and ViTs, e.g., achieving comparable or superior results than Swin Transformer.

Outline

  1. RepLKNet: Five Guidelines
  2. RepLKNet Model Architecture
  3. Results

1. RepLKNet: Five Guidelines

1.1. Guideline 1: Large Depth-Wise (DW) Convolutions can be Efficient in Practice

  • Large-kernel convolutions are computationally expensive. The drawback can be greatly overcome by applying depth-wise (DW) convolutions (MobileNetV1).
  • For example, in the proposed RepLKNet, increasing the kernel sizes in different stages from [3, 3, 3, 3] to [31, 29, 27, 13] only increases the FLOPs and number of parameters by 18.6% and 10.4% respectively, which is acceptable. The remaining 1×1 convolutions actually dominate most of the complexity.
Inference speed of a stack of 24-layer depth-wise convolutions with various kernel sizes and resolutions on a single GTX 2080Ti GPU.

Off-the-shelf deep learning tools (such as: PyTorch) support large DW convolutions poorly.

  • Block-wise (inverse) implicit gemm algorithm is a better choice.
  • The implementation has been integrated into the open-sourced framework MegEngine. Authors also propose an efficient PyTorch implementation, which is far more efficient.

With the optimization, the latency contribution of DW convolutions in RepLKNet reduces from 49.5% to 12.3%, which is roughly in proportion to the FLOPs occupation.

1.2. Guideline 2: Identity Shortcut is Vital Especially for Networks With Very Large Kernels.

Results of different kernel sizes in normal/shortcut-free MobileNetV2.
  • All the DW 3×3 layers are simply replaced with 13×13. Large kernels improve the accuracy of MobileNetV2 with shortcuts by 0.77%. However, without shortcuts, large kernels reduce the accuracy to only 53.98%.

Shortcuts make the model an implicit ensemble composed of numerous models with different receptive fields (RFs), so it can benefit from a much larger maximum RF while not losing the ability to capture small-scale patterns.

1.3. Re-Parameterizing (RepVGG) with Small Kernels Helps to Make Up the Optimization Issue

An example of re-parameterizing a small kernel (e.g., 3×3) into a large one (e.g., 7×7).
Results of 3×3 re-parameterization on MobileNetV2 with various kernel sizes.
  • The 3×3 layers of MobileNetV2 are replaced by 9×9 and 13×13 respectively, and optionally adopt Structural Reparameterization (RepVGG) methodology.
  • Specifically, a 3×3 layer is constructed parallel to the large one, then their outputs are added up after Batch normalization (BN) layers, as shown in the above figure.

After training, the small kernel as well as BN parameters are merged into the large kernel, so the resultant model is equivalent to the model for training but no longer has small kernels.

The above table shows directly increasing the kernel size from 9 to 13 reduces the accuracy, while re-parameterization addresses the issue.

1.4. Guideline 4: Large Convolutions Boost Downstream Tasks Much More Than ImageNet Classification

Results of various kernel sizes in the last stage of MobileNetV2.
  • Increasing the kernel size of MobileNetV2 from 3×3 to 9×9 improves the ImageNet accuracy by 1.33% but the Cityscapes mIoU by 3.99%, where large ERF is more important for semantic segmentation tasks.
  • Therefore, large kernel design significantly increases the ERFs.
  • Also, large kernel design contributes more shape biases to the network. Similar to human, humans recognize objects mainly based on shape cue rather than texture, therefore a model with stronger shape bias may transfer better to downstream tasks.

1.5. Guideline 5: Large Kernel (e.g., 13×13) is Useful Even on Small Feature Maps (e.g., 7×7)

Illustration to convolution with small feature map and large kernel.
  • DW convolutions are enlarged in the last stage of MobileNetV2 to 7×7 or 13×13.
  • Although convolutions in the last stage already involve very large receptive field, further increasing the kernel sizes still leads to performance improvements.

As illustrated in the above figure, two outputs at adjacent spatial locations share only a fraction of the kernel weights. Large kernels not only help to learn the relative positions between concepts, but also encode the absolute position information due to padding effect.

2. RepLKNet Model Architecture

RepLKNet comprises Stem, Stages and Transitions.

2.1. Stem

  • After the first 3×3 with 2 downsampling, a DW 3×3 layer is arranged to capture low-level patterns, a 1×1 conv, and another DW 3×3 layer for downsampling.

2.2. Stages 1–4

  • Each contains several RepLK Blocks, which use shortcuts (Guideline 2) and DW large kernels (Guideline 1).
  • 1×1 conv is used before and after DW conv as a common practice.
  • Each DW large conv uses a 5×5 kernel for re-parameterization (Guideline 3).
  • 1×1 layers are used to increase the depth. Inspired by the feed forward network (FFN) layer in Transformer layer, a similar CNN-style block is used, which is composed of shortcut, BN, two 1×1 layers and GELU, so it is referred to as ConvFFN Block.
  • As a common practice, the number of internal channels of the ConvFFN Block is 4× as the input. Simply following ViT and Swin, which interleave attention and FFN blocks, a ConvFFN is placed after each RepLK Block.

2.3. Transition Blocks

  • Transition blocks are placed between stages, which first increase the channel dimension via 1×1 conv, and then conduct 2× downsampling with DW 3×3 conv.

Each stage has three architectural hyper-parameters: the number of RepLK Blocks B, the channel dimension C, and the kernel size K, so that a RepLKNet architecture is defined by [B1, B2, B3, B4], [C1, C2, C3, C4], [K1, K2, K3, K4].

2.4. Model Variants

RepLKNet with different kernel sizes.

By fixing B=[2, 2, 18, 2], C=[128, 256, 512, 1024], and varying K, the kernel sizes are casually set as [13, 13, 13, 13], [25, 25, 25, 13], [31, 29, 27, 13], respectively, and refer to the models as RepLKNet-13/25/31.

Two small-kernel baselines are also constructed where the kernel sizes are all 3 or 7 (RepLKNet-3/7).

3. Results

3.1. Image Classification on ImageNet

ImageNet results.

With only ImageNet-1K training, RepLKNet-31B reaches 84.8% accuracy, which is 0.3% higher than Swin-B, and runs 43% faster.

Even though RepLKNet-XL has higher FLOPs than Swin-L, it runs faster.

3.2. Semantic Segmentation on Cityscapes & ADE20K

Cityscapes results.
ADE20K results.
  • The pretrained models are used as the backbones of UPerNet.

On Cityscapes, ImageNet-1K-pretrained RepLKNet-31B outperforms Swin-B by a significant margin.

On ADE20K, RepLKNet-31B outperforms Swin-B with both 1K and 22K pretraining, and the margins of single-scale mIoU are particularly significant. Pretrained with the proposed semi-supervised dataset MegData73M, RepLKNet-XL achieves an mIoU of 56.0, which shows feasible scalability towards large-scale vision applications.

3.3. Object Detection on COCO

Object detection on COCO. The FLOPs is computed with 1280×800 inputs.

RepLKNets outperform ResNeXt-101–64×4d by up to 4.4 mAP while have fewer parameters and lower FLOPs.

  • (The results may be further improved with the advanced techniques like HTC [12], HTC++ [61], Soft-NMS [7] or a 6× (72-epoch) schedule.)

Compared to Swin, RepLKNets achieve higher or comparable mAP with fewer parameters and lower FLOPs. Notably, RepLKNet-XL achieves an mAP of 55.5, which demonstrates the scalability again.

3.4. Further Study

The Effective Receptive Field (ERF) of ResNet-101/152 and RepLKNet-13/31 respectively.
  • The contribution of the corresponding pixel on the input image to the central point of the feature map produced by the last layer, is shown above.
  • The high-contribution pixels of ResNet-101 gather around the central point, but the outer points have very low contributions, indicating a limited ERF. ResNet-152 shows a similar pattern, suggesting the more 3×3 layers do not significantly increase the ERF.

The high-contribution pixels by RepLKNet-13 are more evenly distributed, suggesting RepLKNet-13 attends to more outer pixels.

Quantitative analysis on the ERF with the high-contribution area ratio r.
  • High-contribution area ratio r of a minimum rectangle that covers the contribution scores over a given threshold t is measured.

Area ratio of RepLKNet-31 is 98.6%, which means most of pixels considerably contribute to the final predictions.

Shape bias of RepLKNet, Swin, and ResNet-152 pretrained on ImageNet-1K or 22K.

RepLKNet has higher shape bias than Swin while Higher shape bias is closely related to the ERF.

ConvNeXt with different kernel sizes.

Replacing the 7×7 convolutions in ConvNeXt by kernels as large as 31×31 brings significant improvements, e.g., ConvNeXt-Tiny + large kernel > ConvNeXt-Small , and ConvNeXt-Small + large kernel > ConvNeXt-Base.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.