Review — ConvNeXt: A ConvNet for the 2020s

ConvNeXt, ConvNet Without Any Self-Attention, Outperforms

Sik-Ho Tsang
9 min readJun 7, 2022

--

ImageNet-1K classification results for ConvNets (Purple) and  Vision Transformers (Orange)

A ConvNet for the 2020s
ConvNeXt
, by Facebook AI Research (FAIR), and UC Berkeley
2022 CVPR, Over 80 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Convolutional Neural Network, CNN, , ,

  • s have been dominant in 2021. Hierarchical s (e.g., s) that reintroduced several ConvNet priors.
  • In this paper, authors gradually “modernize” a standard toward the design of a , and discover several key components, e.g.: Training Techniques, Macro Design, -ify, Inverted Bottleneck, Large Kernel, and Micro Design.
  • Finally, a network family, ConvNeXt, is formed.

Outline

  1. Training Techniques
  2. Macro Design
  3. -ify
  4. Inverted Bottleneck
  5. Large Kernel Size
  6. Micro Design
  7. ConvNeXt
  8. Experimental Results

1. Training Techniques

ResNet With Better Training Recipe
  • Starting Point: -50 is the model used as a starting point.
  • Baseline: Similar training techniques used to train s are to train the original -50 to obtain much improved results.

The training recipe is close to ’s and ’s.

This enhanced training recipe increased the performance of the -50 model from 76.1% to 78.8% (+2.7%), implying that a significant portion of the performance difference between traditional ConvNets and s may be due to the training techniques.

2. Macro Design

With Better Training Recipe, Macro Design
  • There are two interesting design considerations: the stage compute ratio, and the “stem cell” structure.

2.1. Changing Stage Compute Ratio

  • : The original design of the computation distribution across stages in was largely empirical. The heavy “res4” stage was meant to be compatible with downstream tasks like object detection, where a detector head operates on the 14×14 feature plane.
  • : -T, on the other hand, followed the same principle but with a slightly different stage compute ratio of 1:1:3:1. For larger s, the ratio is 1:1:9:1.

Following the design, authors adjust the number of blocks in each stage from (3, 4, 6, 3) in -50 to (3, 3, 9, 3), which also aligns the FLOPs with -T. This improves the model accuracy from 78.8% to 79.4%.

2.2. Changing Stem to “Patchify”

  • : The stem cell in standard contains a 7×7 convolution layer with stride 2, followed by a max pool, which results in a 4 downsampling of the input images.
  • : In s, a more aggressive “patchify” strategy is used as the stem cell, which corresponds to a large kernel size (e.g. kernel size = 14 or 16) and non-overlapping convolution. uses a similar “patchify” layer, but with a smaller patch size of 4 to accommodate the architecture’s multi-stage design.

The -style stem cell is replaced with a patchify layer implemented using a 4×4, stride 4 convolutional layer. The accuracy has changed from 79.4% to 79.5%.

This suggests that the stem cell in a may be substituted with a simpler “patchify” layer as in which will result in similar performance. The “patchify stem” (4×4 non-overlapping convolution) is used in the network.

3. -ify

With Better Training Recipe, Macro Design, -ify
  • : At a high level, ’s guiding principle is to “use more groups, expand width”. employs grouped convolution for the 3×3 conv layer in a bottleneck block.
  • or : Here, depthwise convolution (as used in or ), a special case of grouped convolution where the number of groups equals the number of channels, is used. Depthwise convolution is similar to the weighted sum operation in self-attention. The combination of depthwise conv and 1×1 convs leads to a separation of spatial and channel mixing, a property shared by s.

Following the strategy proposed in , the network width is increased to the same number of channels as -T’s (from 64 to 96). This brings the network performance to 80.5% with increased FLOPs (5.3G).

4. Inverted Bottleneck

Block modifications and resulted specifications. (a) is a block; in (b) we create an inverted bottleneck block
With Better Training Recipe, Macro Design, -ify, Inverted Bottleneck
  • : One important design in every block is that it creates an inverted bottleneck, i.e., the hidden dimension of the MLP block is four times wider than the input dimension.

From (a) to (b): The inverted bottleneck design is explored. Despite the increased FLOPs for the depthwise convolution layer, this change reduces the whole network FLOPs to 4.6G. Interestingly, this results in slightly improved performance (80.5% to 80.6%).

  • In the -200/-B regime, this step brings even more gain (81.9% to 82.6%) also with reduced FLOPs.

5. Large Kernel Size

(c) The position of the spatial depthwise conv layer is moved up
With Better Training Recipe, Macro Design, -ify, Inverted Bottleneck, Large Kernel Size
  • ConvNet: Such as , it stacks small kernel-sized (3×3) conv layers.
  • : Although s reintroduced the local window to the self-attention block, the window size is at least 7×7, significantly larger than the ResNe(X)t kernel size of 3×3.

5.1. Moving Up Depthwise Conv Layer.

From (b) to (c): To explore large kernels, the proposed prerequisite is to move up the position of the depthwise conv layer. That is a design decision also evident in s: the MSA block is placed prior to the MLP layers. This intermediate step reduces the FLOPs to 4.1G, resulting in a temporary performance degradation to 79.9%.

5.2. Increasing the Kernel Size

By experimenting with several kernel sizes, including 3, 5, 7, 9, and 11. The network’s performance increases from 79.9% (3×3) to 80.6% (7×7). 7×7 depthwise conv is used in each block.

6. Micro Design

Block Designs for a , a , and a ConvNeXt
With Better Training Recipe, Macro Design, -ify, Inverted Bottleneck, Large Kernel Size, Micro Design

6.1. Replacing ReLU with

  • , which can be thought of as a smoother variant of ReLU, is utilized in the most advanced s, including Google’s and OpenAI’s , and, most recently, s.

ReLU can be substituted with , although the accuracy stays unchanged (80.6%).

6.2. Fewer Activation Functions

  • : There is only one activation function present in the MLP block.

All layers are eliminated from the residual block except for one between two 1×1 layers, replicating the style of a block. This process improves the result by 0.7% to 81.3%, practically matching the performance of -T. Thus only a single activation is used in each block.

6.3. Fewer Normalization Layers

  • blocks usually have fewer normalization layers as well.

Here, two layers are removed, leaving only one layer before the conv 1×1 layers. This further boosts the performance to 81.4%, already surpassing -T’s result.

6.4. Substituting with

  • Directly substituting for in the original will result in suboptimal performance.

The proposed ConvNet model does not have any difficulties training with ; in fact, the performance is slightly better, obtaining an accuracy of 81.5%. From now on, one is used as the choice of normalization in each residual block.

6.5. Separate Downsampling Layers

  • : the spatial downsampling is achieved by the residual block at the start of each stage, using 3×3 conv with stride 2 (and 1×1 conv with stride 2 at the shortcut connection).
  • s: a separate downsampling layer is added between stages.

A similar strategy is explored in which 2×2 conv layers with stride 2 are used for spatial downsampling. This modification surprisingly leads to diverged training. One before each downsampling layer, one after the stem, and one after the final global average pooling, are added, to stabilize the training. The accuracy is increased to 82.0%, significantly exceeding -T’s 81.3%.

7. ConvNeXt

7.1. All Improvements Together

A standard ConvNet () is modernized towards the design of a hierarchical (), without introducing any attention-based modules.
  • The above figure summarizes all the above improvements, which finally surpasses the .
  • These designs are not novel but they have all been researched separately, but not collectively, over the last decade.

7.2. Detailed Architecture

Detailed Architecture Specifications for -50, ConvNeXt-T and -T

7.3. ConvNeXt Variants

Different ConvNeXt Variants
  • Different ConvNeXt variants, ConvNeXt-T/S/B/L, are designed to be of similar complexities to -T/S/B/L. ConvNeXt-T/B is the end product of the “modernizing” procedure on -50/200 regime, respectively.
  • In addition, a larger ConvNeXt-XL is built to further test the scalability of ConvNeXt. The variants only differ in the number of channels C, and the number of blocks B in each stage. Following both s and s, the number of channels doubles at each new stage.

8. Experimental Results

8.1. ImageNet

Classification Accuracy on ImageNet-1K

8.1.1. ImageNet-1K (Upper Part of Table)

  • ConvNeXt competes favorably with two strong ConvNet baselines ( and ) in terms of the accuracy-computation trade-off, as well as the inference throughputs.
  • ConvNeXt also outperforms of similar complexities across the board, sometimes with a substantial margin (e.g. 0.8% for ConvNeXt-T). ConvNeXts also enjoy improved throughput compared to s.
  • e.g.: A highlight from the results is ConvNeXt-B at 384²: it outperforms -B by 0.6% (85.1% vs. 84.5%), but with 12.5% higher inference throughput (95.7 vs. 85.1 image/s).
  • An improved result of 85.5% is obtained when further scaling to ConvNeXt-L.

8.1.2. Pretraining Using ImageNet-22K on ImageNet-1K (Lower Part of Table)

  • ConvNeXts still perform on par or better than similarly-sized s, with slightly higher throughput.
  • Additionally, ConvNeXt-XL model achieves an accuracy of 87.8% — a decent improvement over ConvNeXt-L at 384², demonstrating that ConvNeXts are scalable architectures.
  • ConvNeXt is able to outperform , further demonstrating the importance of large-scale training.

8.2. Isotropic ConvNeXt vs.

Comparing isotropic ConvNeXt and
  • This part is to examine if ConvNeXt block design is generalizable to -style isotropic architectures which have no downsampling layers and keep the same feature resolutions (e.g. 14×14) at all depths.

ConvNeXt can perform generally on par with , showing that ConvNeXt block design is competitive when used in non-hierarchical models.

8.3. Object Detection and Segmentation on COCO

COCO object detection and segmentation results using and
  • and are fine-tuned on the COCO dataset with ConvNeXt backbones. Multi-scale training, AdamW optimizer, and a 3× schedule, are used.
  • Training using ConvNeXt-B backbone consumes 17.4GB of peak memory with a per-GPU batch size of 2, while the reference number for -B is 18.5GB.

Across different model complexities, ConvNeXt achieves on-par or better performance than .

  • When scaled up to bigger models (ConvNeXt-B/L/XL) pre-trained on ImageNet-22K, in many cases ConvNeXt is significantly better (e.g. +1.0 AP) than s in terms of box and mask AP.

8.4. Semantic Segmentation on

validation results using
  • ConvNeXt backbones are evaluated on the semantic segmentation task with .

ConvNeXt models can achieve competitive performance across different model capacities, further validating the effectiveness of architecture design.

(While s in computer vision are the mainstream research in 2021, ConvNeXt outperforms now, which is awesome!)

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response