Review — ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

ConvNeXt V2, Improved Architecture & Pretraining Process

Sik-Ho Tsang
6 min readJul 14, 2024
ConvNeXt V2 model scaling, performs significantly better than the previous ConvNeXt V1

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
ConvNeXt V2
, by KAIST, Meta AI, New York University
2023 CVPR, Over 310 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] 2024 [FasterViT]
==== My Other Paper Readings Are Also Over Here ====

  • While normally having SSL pretrained, the model should obtain substantial performance improvement, it is found that simply combining masked autoencoders (MAE) self-supervised learning with ConNeXt V1 leads to subpar performance.
  • In this paper, ConvNeXt V2 is proposed, in which a fully convolutional masked autoencoder (FCMAE) framework and a new Global Response Normalization (GRN) layer are proposed to enhance inter-channel feature competition for better performance.

Outline

  1. ConNeXt V2: FCMAE
  2. ConNeXt V2: Global Response Normalization
  3. Results

1. ConNeXt V2: FCMAE

FCMAE With Sparse Convolution
  • As shown above, the approach is conceptually simple and runs in a fully convolutional manner.
  • The learning signals are generated by randomly masking the raw input visuals with a high masking ratio and letting the model predict the missing parts given the remaining context.

1.1. Masking

The features are downsampled in different stages, the mask is generated in the last stage and upsampled recursively up to the finest resolution.

  • Masking ratio of 0.6 is used. 60% of the 32 × 32 patches are randomly removed from the original input image. Minimal data augmentation is used, i.e. only including random resized cropping.

1.2. Encoder Design with Sparse Convolution

A key observation is that the masked image can be represented as a 2D sparse array of pixels.

  • Based on this insight, it is natural to incorporate sparse convolution into the framework to facilitate pre-training of the masked autoencoder.

The standard convolution layer in the encoder is converted with the sub-manifold sparse convolution, which enables the model to operate only on the visible data points [5, 11, 12].

  • As an alternative, it is also possible to apply a binary masking operation before and after the dense convolution operation. This operation has numerically the same effect as sparse convolutions, but can be more computational friendly.

1.3. Decoder Design

A lightweight, plain ConvNeXt block, is used as the decoder. This forms an asymmetric encoder-decoder architecture overall.

  • As shown above, other more complicated decoders are tried but the lightweight one is already fine.
  • The dimension of the decoder to 512 is used.

1.4. Reconstruction Target

The mean squared error (MSE) between the reconstructed and target images is computed.

1.5. Fine-Tuning

Conventional convolution is used for fine-tuning
  • After pretraining, sparse convolution can be removed. Instead, the conventional convolution is used for fine-tuning.

1.6. Intermediate Results

Sparse Convolution
  • With the Sparse Convolution, the model is pre-trained and fine-tuned using the ImageNet-1K (IN-1K) dataset for 800 and 100 epochs, respectively, and the top-1 IN-1K validation accuracy for a single 224×224 center crop is reported.

It shows incredible improvement.

FCMAE Self-Supervised Learning

However, though FCMAE pre-training provides better initialization than the random baseline (i.e., 82.7 → 83.7), the results are not impressive enough as SSL usually brings large improvements.

2. ConNeXt V2: Global Response Normalization

2.1. Feature Collapse in ConNeXt V1

Feature Activation Visualization
  • It is found that ConvNeXt V1 has feature collapse behaviour; there are many dead (dark blue) or saturated (bright yellow) feature maps and the activation becomes redundant across channels.
  • Global Response Normalization is used to promote the feature diversity.

2.2. Global Response Normalization

Global Response Normalization
Global Response Normalization Pseudo Code

Given an input feature, X, the proposed GRN unit consists of three steps: 1) global feature aggregation, 2) feature normalization, and 3) feature calibration, which aims to increase the contrast and selectivity of channels.

  • Using global average pooling for aggregation cannot perform well.
  • 1) Using norm-based feature aggregation, specifically, using L2-norm, resulted in better performance.
  • 2) Next, a standard divisive normalization is used as follows:
  • 3) Finally, the original input responses are calibrated using the computed feature normalization scores:
  • To ease optimization, two additional learnable parameters, γ and β, are added and initialized to zero, similar to BN.
  • A residual connection is also added between the input and output of the GRN layer.
Feature Diversity
  • The average pair-wise cosine distance across the channels is estimated. A higher distance value indicates more diverse features, while a lower value indicates feature redundancy.

With GRN, ConNeXt V2 FCMAE obtains the highest feature diversity.

Global Response Normalization

The above ablation shows that the proposed GRN obtains the best results.

2.3. ConvNeXt V2 Block

ConNeXt V1 Block vs ConNeXt V2 Block
  • With GRN, ConvNeXt V2 Block is proposed as above (right).
ConNeXt V2 + FCMAE

And now, with GRN, ConNeXt V2 with FCMAE can obtain even better results.

2.4. Model Scaling

  • A range of 8 models with different sizes, from a low-capacity 3.7M Atto model to a high-capacity 650M Huge model, is designed.

3. Results

3.1. ImageNet

Co-design matters.

The combination of the two results in a significant improvement in fine-tuning performance.

Compared to the plain ViT pre-trained with MAE, the proposed approach performs similarly up to the Large model regime, despite using much fewer parameters (198M vs 307M).

  • However, in the huge model regime, the proposed approach slightly lagged behind. This might be because a huge ViT model can benefit more from self-supervised pre-training.

Using a convolution-based architecture, sets a new state-of-the-art accuracy using publicly available data only (i.e. ImageNet-1K and ImageNet-22K).

3.2. Transfer Learning

COCO & ADE20K

On COCO, Additionally, the final proposal, ConvNeXt V2 pre-trained on FCMAE, outperforms the Swin Transformer counterparts across all model sizes, with the largest gap achieved in the huge model regime.

On ADE20K semantic segmentation, the final model significantly improves over the V1 supervised counterparts. It also performs on par with the Swin Transformer in the base and large model regimes but outperforms Swin in the huge model regime.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.