Review — ConvNeXt: A ConvNet for the 2020s

ConvNeXt, ConvNet Without Any Self-Attention, Outperforms Swin Transformer

ImageNet-1K classification results for ConvNets (Purple) and  Vision Transformers (Orange)
  • Vision Transformers have been dominant in 2021. Hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors.
  • In this paper, authors gradually “modernize” a standard ResNet toward the design of a Vision Transformer, and discover several key components, e.g.: Training Techniques, Macro Design, ResNeXt-ify, Inverted Bottleneck, Large Kernel, and Micro Design.
  • Finally, a network family, ConvNeXt, is formed.

Outline

  1. Training Techniques
  2. Macro Design
  3. ResNeXt-ify
  4. Inverted Bottleneck
  5. Large Kernel Size
  6. Micro Design
  7. ConvNeXt
  8. Experimental Results

1. Training Techniques

ResNet With Better Training Recipe
  • Starting Point: ResNet-50 is the model used as a starting point.
  • Baseline: Similar training techniques used to train Vision Transformers are to train the original ResNet-50 to obtain much improved results.

2. Macro Design

ResNet With Better Training Recipe, Macro Design
  • There are two interesting design considerations: the stage compute ratio, and the “stem cell” structure.

2.1. Changing Stage Compute Ratio

  • ResNet: The original design of the computation distribution across stages in ResNet was largely empirical. The heavy “res4” stage was meant to be compatible with downstream tasks like object detection, where a detector head operates on the 14×14 feature plane.
  • Transformer: Swin-T, on the other hand, followed the same principle but with a slightly different stage compute ratio of 1:1:3:1. For larger Swin Transformers, the ratio is 1:1:9:1.

2.2. Changing Stem to “Patchify”

  • ResNet: The stem cell in standard ResNet contains a 7×7 convolution layer with stride 2, followed by a max pool, which results in a 4 downsampling of the input images.
  • Transformer: In Vision Transformers, a more aggressive “patchify” strategy is used as the stem cell, which corresponds to a large kernel size (e.g. kernel size = 14 or 16) and non-overlapping convolution. Swin Transformer uses a similar “patchify” layer, but with a smaller patch size of 4 to accommodate the architecture’s multi-stage design.

3. ResNeXt-ify

ResNet With Better Training Recipe, Macro Design, ResNeXt-ify
  • ResNeXt: At a high level, ResNeXt’s guiding principle is to “use more groups, expand width”. ResNeXt employs grouped convolution for the 3×3 conv layer in a bottleneck block.
  • MobileNet or Xception: Here, depthwise convolution (as used in MobileNet or Xception), a special case of grouped convolution where the number of groups equals the number of channels, is used. Depthwise convolution is similar to the weighted sum operation in self-attention. The combination of depthwise conv and 1×1 convs leads to a separation of spatial and channel mixing, a property shared by Vision Transformers.

4. Inverted Bottleneck

Block modifications and resulted specifications. (a) is a ResNeXt block; in (b) we create an inverted bottleneck block
ResNet With Better Training Recipe, Macro Design, ResNeXt-ify, Inverted Bottleneck
  • Transformer: One important design in every Transformer block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension.
  • In the ResNet-200/Swin-B regime, this step brings even more gain (81.9% to 82.6%) also with reduced FLOPs.

5. Large Kernel Size

(c) The position of the spatial depthwise conv layer is moved up
ResNet With Better Training Recipe, Macro Design, ResNeXt-ify, Inverted Bottleneck, Large Kernel Size
  • ConvNet: Such as VGGNet, it stacks small kernel-sized (3×3) conv layers.
  • Transformer: Although Swin Transformers reintroduced the local window to the self-attention block, the window size is at least 7×7, significantly larger than the ResNe(X)t kernel size of 3×3.

5.1. Moving Up Depthwise Conv Layer.

5.2. Increasing the Kernel Size

6. Micro Design

Block Designs for a ResNet, a Swin Transformer, and a ConvNeXt
ResNet With Better Training Recipe, Macro Design, ResNeXt-ify, Inverted Bottleneck, Large Kernel Size, Micro Design

6.1. Replacing ReLU with GELU

  • GELU, which can be thought of as a smoother variant of ReLU, is utilized in the most advanced Transformers, including Google’s BERT and OpenAI’s GPT-2, and, most recently, ViTs.

6.2. Fewer Activation Functions

  • Transformer: There is only one activation function present in the MLP block.

6.3. Fewer Normalization Layers

  • Transformer blocks usually have fewer normalization layers as well.

6.4. Substituting BN with LN

  • Directly substituting LN for BN in the original ResNet will result in suboptimal performance.

6.5. Separate Downsampling Layers

  • ResNet: the spatial downsampling is achieved by the residual block at the start of each stage, using 3×3 conv with stride 2 (and 1×1 conv with stride 2 at the shortcut connection).
  • Swin Transformers: a separate downsampling layer is added between stages.

7. ConvNeXt

7.1. All Improvements Together

A standard ConvNet (ResNet) is modernized towards the design of a hierarchical Vision Transformer (Swin), without introducing any attention-based modules.
  • The above figure summarizes all the above improvements, which finally surpasses the Swin Transformer.
  • These designs are not novel but they have all been researched separately, but not collectively, over the last decade.

7.2. Detailed Architecture

Detailed Architecture Specifications for ResNet-50, ConvNeXt-T and Swin-T

7.3. ConvNeXt Variants

Different ConvNeXt Variants
  • Different ConvNeXt variants, ConvNeXt-T/S/B/L, are designed to be of similar complexities to Swin-T/S/B/L. ConvNeXt-T/B is the end product of the “modernizing” procedure on ResNet-50/200 regime, respectively.
  • In addition, a larger ConvNeXt-XL is built to further test the scalability of ConvNeXt. The variants only differ in the number of channels C, and the number of blocks B in each stage. Following both ResNets and Swin Transformers, the number of channels doubles at each new stage.

8. Experimental Results

8.1. ImageNet

Classification Accuracy on ImageNet-1K

8.1.1. ImageNet-1K (Upper Part of Table)

  • ConvNeXt competes favorably with two strong ConvNet baselines (RegNet and EfficientNet) in terms of the accuracy-computation trade-off, as well as the inference throughputs.
  • ConvNeXt also outperforms Swin Transformer of similar complexities across the board, sometimes with a substantial margin (e.g. 0.8% for ConvNeXt-T). ConvNeXts also enjoy improved throughput compared to Swin Transformers.
  • e.g.: A highlight from the results is ConvNeXt-B at 384²: it outperforms Swin-B by 0.6% (85.1% vs. 84.5%), but with 12.5% higher inference throughput (95.7 vs. 85.1 image/s).
  • An improved result of 85.5% is obtained when further scaling to ConvNeXt-L.

8.1.2. Pretraining Using ImageNet-22K on ImageNet-1K (Lower Part of Table)

  • ConvNeXts still perform on par or better than similarly-sized Swin Transformers, with slightly higher throughput.
  • Additionally, ConvNeXt-XL model achieves an accuracy of 87.8% — a decent improvement over ConvNeXt-L at 384², demonstrating that ConvNeXts are scalable architectures.
  • ConvNeXt is able to outperform EfficientNetV2, further demonstrating the importance of large-scale training.

8.2. Isotropic ConvNeXt vs. ViT

Comparing isotropic ConvNeXt and ViT
  • This part is to examine if ConvNeXt block design is generalizable to ViT-style isotropic architectures which have no downsampling layers and keep the same feature resolutions (e.g. 14×14) at all depths.

8.3. Object Detection and Segmentation on COCO

COCO object detection and segmentation results using Mask R-CNN and Cascade R-CNN
  • Mask R-CNN and Cascade R-CNN are fine-tuned on the COCO dataset with ConvNeXt backbones. Multi-scale training, AdamW optimizer, and a 3× schedule, are used.
  • Training Cascade R-CNN using ConvNeXt-B backbone consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the reference number for Swin-B is 18.5GB.
  • When scaled up to bigger models (ConvNeXt-B/L/XL) pre-trained on ImageNet-22K, in many cases ConvNeXt is significantly better (e.g. +1.0 AP) than Swin Transformers in terms of box and mask AP.

8.4. Semantic Segmentation on ADE20K

ADE20K validation results using UperNet
  • ConvNeXt backbones are evaluated on the ADE20K semantic segmentation task with UperNet.

--

--

PhD, Researcher. I share what I learn. :) Reads: https://bit.ly/33TDhxG, LinkedIn: https://www.linkedin.com/in/sh-tsang/, Twitter: https://twitter.com/SHTsang3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store