Review — ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
ConvNeXt V2, Improved Architecture & Pretraining Process
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
ConvNeXt V2, by KAIST, Meta AI, New York University
2023 CVPR, Over 310 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] 2024 [FasterViT]
==== My Other Paper Readings Are Also Over Here ====
- While normally having SSL pretrained, the model should obtain substantial performance improvement, it is found that simply combining masked autoencoders (MAE) self-supervised learning with ConNeXt V1 leads to subpar performance.
- In this paper, ConvNeXt V2 is proposed, in which a fully convolutional masked autoencoder (FCMAE) framework and a new Global Response Normalization (GRN) layer are proposed to enhance inter-channel feature competition for better performance.
Outline
- ConNeXt V2: FCMAE
- ConNeXt V2: Global Response Normalization
- Results
1. ConNeXt V2: FCMAE
- As shown above, the approach is conceptually simple and runs in a fully convolutional manner.
- The learning signals are generated by randomly masking the raw input visuals with a high masking ratio and letting the model predict the missing parts given the remaining context.
1.1. Masking
The features are downsampled in different stages, the mask is generated in the last stage and upsampled recursively up to the finest resolution.
- Masking ratio of 0.6 is used. 60% of the 32 × 32 patches are randomly removed from the original input image. Minimal data augmentation is used, i.e. only including random resized cropping.
1.2. Encoder Design with Sparse Convolution
A key observation is that the masked image can be represented as a 2D sparse array of pixels.
- Based on this insight, it is natural to incorporate sparse convolution into the framework to facilitate pre-training of the masked autoencoder.
The standard convolution layer in the encoder is converted with the sub-manifold sparse convolution, which enables the model to operate only on the visible data points [5, 11, 12].
- As an alternative, it is also possible to apply a binary masking operation before and after the dense convolution operation. This operation has numerically the same effect as sparse convolutions, but can be more computational friendly.
1.3. Decoder Design
A lightweight, plain ConvNeXt block, is used as the decoder. This forms an asymmetric encoder-decoder architecture overall.
- As shown above, other more complicated decoders are tried but the lightweight one is already fine.
- The dimension of the decoder to 512 is used.
1.4. Reconstruction Target
The mean squared error (MSE) between the reconstructed and target images is computed.
1.5. Fine-Tuning
- After pretraining, sparse convolution can be removed. Instead, the conventional convolution is used for fine-tuning.
1.6. Intermediate Results
- With the Sparse Convolution, the model is pre-trained and fine-tuned using the ImageNet-1K (IN-1K) dataset for 800 and 100 epochs, respectively, and the top-1 IN-1K validation accuracy for a single 224×224 center crop is reported.
It shows incredible improvement.
However, though FCMAE pre-training provides better initialization than the random baseline (i.e., 82.7 → 83.7), the results are not impressive enough as SSL usually brings large improvements.
2. ConNeXt V2: Global Response Normalization
2.1. Feature Collapse in ConNeXt V1
- It is found that ConvNeXt V1 has feature collapse behaviour; there are many dead (dark blue) or saturated (bright yellow) feature maps and the activation becomes redundant across channels.
- Global Response Normalization is used to promote the feature diversity.
2.2. Global Response Normalization
Given an input feature, X, the proposed GRN unit consists of three steps: 1) global feature aggregation, 2) feature normalization, and 3) feature calibration, which aims to increase the contrast and selectivity of channels.
- Using global average pooling for aggregation cannot perform well.
- 1) Using norm-based feature aggregation, specifically, using L2-norm, resulted in better performance.
- 2) Next, a standard divisive normalization is used as follows:
- 3) Finally, the original input responses are calibrated using the computed feature normalization scores:
- To ease optimization, two additional learnable parameters, γ and β, are added and initialized to zero, similar to BN.
- A residual connection is also added between the input and output of the GRN layer.
- The average pair-wise cosine distance across the channels is estimated. A higher distance value indicates more diverse features, while a lower value indicates feature redundancy.
With GRN, ConNeXt V2 FCMAE obtains the highest feature diversity.
The above ablation shows that the proposed GRN obtains the best results.
2.3. ConvNeXt V2 Block
- With GRN, ConvNeXt V2 Block is proposed as above (right).
And now, with GRN, ConNeXt V2 with FCMAE can obtain even better results.
2.4. Model Scaling
- A range of 8 models with different sizes, from a low-capacity 3.7M Atto model to a high-capacity 650M Huge model, is designed.
3. Results
3.1. ImageNet
The combination of the two results in a significant improvement in fine-tuning performance.
Compared to the plain ViT pre-trained with MAE, the proposed approach performs similarly up to the Large model regime, despite using much fewer parameters (198M vs 307M).
- However, in the huge model regime, the proposed approach slightly lagged behind. This might be because a huge ViT model can benefit more from self-supervised pre-training.
Using a convolution-based architecture, sets a new state-of-the-art accuracy using publicly available data only (i.e. ImageNet-1K and ImageNet-22K).
3.2. Transfer Learning
On COCO, Additionally, the final proposal, ConvNeXt V2 pre-trained on FCMAE, outperforms the Swin Transformer counterparts across all model sizes, with the largest gap achieved in the huge model regime.
On ADE20K semantic segmentation, the final model significantly improves over the V1 supervised counterparts. It also performs on par with the Swin Transformer in the base and large model regimes but outperforms Swin in the huge model regime.