Review — DeiT III: Revenge of the ViT
DeiT III: Revenge of the ViT,
DeiT III, by Meta AI, and Sorbonne University
2022 ECCV, Over 60 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
Outline
- Revisit Training & Pre-Training for Vision Transformers
- Results
1. Revisit Training & Pre-Training for Vision Transformers
- The proposed training recipe is based on ResNet Strikes Back and DeiT. The different ingredients are in above table.
- Some important ingredients will be mentioned below.
1.1. 3-Augment
- RandAugment is widely employed for ViT while their policy was initially learned for convnets. Given that the architectural priors and biases are quite different in these architectures, the augmentation policy may not be adapted, and possibly overfitted considering the large amount of choices involved in their selection.
3-Augment is proposed, which is a simple data augmentation inspired by what is used in self-supervised learning (SSL), as above.
Grayscale: This favors color invariance and give more focus on shapes.
Solarization: This adds strong noise on the color to be more robust to the variation of color intensity and so focus more on shape.
Gaussian Blur: In order to slightly alter details in the image.
- The common color-jitter and horizontal flip are still used.
- The ablation on the proposed different data-augmentation components is shown above.
1.2. Simple Random Crop (SRC)
- Left: Random Resized Crop (RRC) is commonly used. This cropping strategy however introduces some discrepancy between train and test images, as mentioned in FixRes.
- RRC provides a lot of diversity and very different sizes for crops.
Right: Simple Random Crop (SRC) is a much simpler way to extract crops.
The image is resized such that the smallest side matches the training resolution. Then a reflect padding of 4 pixels is applied on all sides, and finally a square crop is applied of training size randomly selected along the x-axis of the image.
SRC covers a much larger fraction of the image overall and preserve the aspect ratio, but offers less diversity.
- SRC crops overlaps significantly. As a result, when training on ImageNet-1k the performance is better with the commonly used RRC.
However, in the case of ImageNet-21k (×10 bigger than ImageNet-1k), there is less risk of overfitting and increasing the regularisation and diversity offered by RRC is less important. In this context, SRC offers the advantage of reducing the discrepancy in apparent size and aspect ratio.
- RRC is relatively aggressive in terms of cropping and in many cases the labelled object is not even present in the crop, as shown in the above figure where some of the crops do not contain the labelled object.
- For instance, with RRC there is a crop no zebra in the left example, or no train in three of the crops from the middle example.
2. Results
2.1. Impact of Training Duration
ViT models using the proposed training recipe do not saturate as rapidly as the DeiT training procedure.
2.2. Data Augmentation
With the ViT architecture, the proposed 3-augment data-augmentation is the most effective while being simpler than the other approaches.
More settings are tested. Similar observation.
Under ImageNet-21k pre-training condition, the proposed training recipe, using SRC and 3-Augment (3A), trained using 224² resolution, obtains the best results.
2.3. Impact of Training Resolution
The proposed training recipe still benefits from the FixRes effect. By training at resolution 192×192 (or 160×160) we get a better performance at 224 after a slight fine-tuning than when training from scratch at 224×224.
2.4. SOTA Comparison on Image Classification
By looking at ImageNet-val top-1 accuracy vs ImageNet-V2 top-1 accuracy, the restricted choice of hyperparameters and variants in the proposed recipe does not lead to (too much) overfitting (The dots over the straight line).
The proposed approach gives comparable or better performance on both ImageNet-1k and ImageNet-V2.
2.5. SOTA Comparison on Self-Supervised Pretraining
For an equivalent number of epochs, the proposed approach gives comparable performance on ImageNet-1k and better on ImageNet-V2.
2.6. Downstream Performance
Left: On iNaturalist, this parameter has a significant impact on the performance. By finetuning only the attention layers for transfer learning experiments on Flowers, this improves performance by 0.2%.
Right: Vanilla ViTs trained with the proposed training recipes have a better FLOPs-accuracy trade-off than recent architectures like XCiT or Swin.
2.7. Training With Others Architectures
- For some architectures like PiT or CaiT, the proposed training method will improve the performance.
- For some others like TNT, the proposed approach is neutral and for architectures like Swin, it decreases the performance.
- This is consistent with the findings of ResNet Strikes Back and illustrates the need to improve the training procedure in conjunction to the architecture to obtain robust conclusions.