Brief Review — More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

SLaK, with up to 61x61 convolutions

Sik-Ho Tsang
5 min readOct 1, 2024

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity
SLaK
, by University of Texas at Austin, Eindhoven University of Technology, University of Twente, University of Jyväskylä, and University of Luxembourg
2023 ICLR, Over 150 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====

  • In this paper, authors explore the possibility of training extreme convolutions larger than 31×31.
  • This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61×61.
  • Finally, Sparse Large Kernel Network (SLaK) is proposed, which is a pure CNN architecture equipped with sparse factorized 51×51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers.

Outline

  1. Sparse Large Kernel Network (SLaK)
  2. Results

1. Sparse Large Kernel Network (SLaK)

1.1. Prior ConvNeXt & RepLKNet

(a) ConvNeXt, (b) RepLKNet, (c) SLaK
Enlarging kernel size

As expected, naively enlarging kernel size from 7×7 to 31×31 decreases the performance, whereas RepLKNet can overcome this problem by improving the accuracy by 0.5%. Unfortunately, this positive trend does not continue when further increasing kernel size to 51×51.

  • One plausible explanation is the stem cell where ConvNeXt results in a 4× downsampling of the input images, 51×51 kernels are already roughly equal to the global convolution for the typical 224×224 ImageNet. Yet, well-designed local attention, which is proved in some prior arts, usually outperforms global attention.

Introducing the locality while preserving the ability to capture global relations is one of the solutions.

1.2. Kernel Decomposition (Step 1)

Two-Step Recipe
  • Decomposing a large kernel into two rectangular, parallel kernels smoothly scales the kernel size up to 61×61.

As in (c) of the first figure, here is to approximate the large M×M kernel with a combination of two parallel and rectangular convolutions whose kernel size is M×N and N×M. Also, following RepLKNet, a 5×5 layer is kept parallel to the large kernels and summed up their outputs after a batch norm layer.

  • This decomposition balances between capturing long-range dependencies and extracting local detail features (with its shorter edge).

1.3. Sparse Group (Step 2)

Dynamic Sparsity
  • The dense convolutions are first replaced with sparse convolutions, where the sparse kernels are randomly constructed based on the layer-wise sparsity ratio (=40%) of SNIP.
  • Then, the model width is expanded by 1.3× while keeping the parameter count and FLOPs roughly the same as the dense model.
Sparse Groups

The performance consistently increases with kernel size, up to 51×51.

1.4. SLaK Model

  • SLaK is built based on the architecture of ConvNeXt. The design of the stage compute ratio and the stem cell are inherited from ConvNeXt.
  • The number of blocks in each stage is [3, 3, 9, 3] for SLaK-T and [3, 3, 27, 3] for SLaK-S/B. The stem cell is simply a convolution layer with 4×4 kernels and 4 strides.
  • The kernel size of ConvNeXt is first directly increased to [51, 49, 47, 13] for each stage, and each M×M kernel is replaced with a combination of M×5 and 5×M kernels.
  • Adding a Batch Norm layer directly after each decomposed kernel is crucial before summing the output up.
  • Finally, the whole network is further sparsified and the width of stages is expanded by 1.3×, ending up with SLaK.

2. Results

2.1. ImageNet-1K

ImageNet-1K

With similar model sizes and FLOPs, SLaK outperforms the existing convolutional models such as ResNet, ResNeXt, RepLKNet, and ConvNeXt.

Without using any complex attention modules and patch embedding, SLaK is able to achieve higher accuracy than the state-of-the-art Transformers, e.g., Swin Transformer and Pyramid Vision Transformer (PVT).

2.2. Downstream Tasks

ADE20K

A very clear trend is observed that the performance increases as the kernel size.

SLaK-T with larger kernels (51×51) further brings 1.2% mIoU improvement over ConvNeXt-T (RepLKNet), surpassing the performance of ConvNeXt-S.

PASCAL VOC 2007

ConvNeXt-T with 31×31 kernels achieves 0.7% higher mean Average Precision (mAP) than the 7×7 kernels and SLaK-T with 51 kernel sizes further brings 1.4% mAP improvement, highlighting the crucial role of extremely large kernels on downstream vision tasks.

MS COCO

The performance consistently improves with the increase of kernel size and the 51×51 kernel SLaK outperforms smaller-kernel models.

2.3. Analysis

Effective receptive field (ERF)

In comparison with ConvNeXt and RepLKNet, high-contribution pixels of SLaK spread in a much larger ERF.

The proposed methods would require fewer FLOPs and parameters, compared to full-kernel scaling with 51×51 kernels.

The sparse decomposed kernels yield more than 4× real inference speed acceleration than directly using vanilla large kernels.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.