Review — DeiT III: Revenge of the ViT

Revisit ViT, Much Better ViT, DeiT III, is Proposed

Sik-Ho Tsang
5 min readApr 19, 2023

DeiT III: Revenge of the ViT,
DeiT III, by Meta AI, and Sorbonne University
2022 ECCV, Over 60 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • The supervised training of ViTs is revisited.
  • The procedure builds upon and simplifies a recipe introduced for training ResNet-50.
  • It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning.


  1. Revisit Training & Pre-Training for Vision Transformers
  2. Results

1. Revisit Training & Pre-Training for Vision Transformers

Summary of the proposed training procedures with ImageNet-1k and ImageNet-21k.
  • The proposed training recipe is based on ResNet Strikes Back and DeiT. The different ingredients are in above table.
  • Some important ingredients will be mentioned below.

1.1. 3-Augment

Illustration of the 3 types of data-augmentations used in 3-Augment.
  • RandAugment is widely employed for ViT while their policy was initially learned for convnets. Given that the architectural priors and biases are quite different in these architectures, the augmentation policy may not be adapted, and possibly overfitted considering the large amount of choices involved in their selection.

3-Augment is proposed, which is a simple data augmentation inspired by what is used in self-supervised learning (SSL), as above.

Grayscale: This favors color invariance and give more focus on shapes.

Solarization: This adds strong noise on the color to be more robust to the variation of color intensity and so focus more on shape.

Gaussian Blur: In order to slightly alter details in the image.

  • The common color-jitter and horizontal flip are still used.
Ablation of the components of the proposed data-augmentation strategy with ViT-B on ImageNet-1k.
  • The ablation on the proposed different data-augmentation components is shown above.

1.2. Simple Random Crop (SRC)

Example of crops selected by two strategies: Resized Crop and Simple Random Crop.
  • Left: Random Resized Crop (RRC) is commonly used. This cropping strategy however introduces some discrepancy between train and test images, as mentioned in FixRes.
  • RRC provides a lot of diversity and very different sizes for crops.

Right: Simple Random Crop (SRC) is a much simpler way to extract crops.

The image is resized such that the smallest side matches the training resolution. Then a reflect padding of 4 pixels is applied on all sides, and finally a square crop is applied of training size randomly selected along the x-axis of the image.

SRC covers a much larger fraction of the image overall and preserve the aspect ratio, but offers less diversity.

  • SRC crops overlaps significantly. As a result, when training on ImageNet-1k the performance is better with the commonly used RRC.

However, in the case of ImageNet-21k (×10 bigger than ImageNet-1k), there is less risk of overfitting and increasing the regularisation and diversity offered by RRC is less important. In this context, SRC offers the advantage of reducing the discrepancy in apparent size and aspect ratio.

Illustration of Random Resized Crop (RRC) and Simple Random Crop (SRC).
  • RRC is relatively aggressive in terms of cropping and in many cases the labelled object is not even present in the crop, as shown in the above figure where some of the crops do not contain the labelled object.
  • For instance, with RRC there is a crop no zebra in the left example, or no train in three of the crops from the middle example.

2. Results

2.1. Impact of Training Duration

Top-1 accuracy on ImageNet-1k only at resolution 224×224 with our training recipes and a different number of epochs.

ViT models using the proposed training recipe do not saturate as rapidly as the DeiT training procedure.

2.2. Data Augmentation

Comparison of some existing data-augmentation methods with the proposed simple 3-Augment proposal inspired by data-augmentation used with self-supervised learning.

With the ViT architecture, the proposed 3-augment data-augmentation is the most effective while being simpler than the other approaches.

Ablation on different training component with training at resolution 224 × 224 on ImageNet-1k.

More settings are tested. Similar observation.

Ablation path: augmentation and regularization with ImageNet-21k pre-training (at resolution 224×224) and ImageNet-1k fine-tuning.

Under ImageNet-21k pre-training condition, the proposed training recipe, using SRC and 3-Augment (3A), trained using 224² resolution, obtains the best results.

2.3. Impact of Training Resolution

ViT architectures pre-trained on ImageNet-1k only with different training resolution followed by a fine-tuning at resolution 224 × 224.

The proposed training recipe still benefits from the FixRes effect. By training at resolution 192×192 (or 160×160) we get a better performance at 224 after a slight fine-tuning than when training from scratch at 224×224.

2.4. SOTA Comparison on Image Classification

Generalization experiment: top-1 accuracy on ImageNet1k-val versus ImageNet-V2 for models in below 2 tables.

By looking at ImageNet-val top-1 accuracy vs ImageNet-V2 top-1 accuracy, the restricted choice of hyperparameters and variants in the proposed recipe does not lead to (too much) overfitting (The dots over the straight line).

Left: Classification with Imagenet1k training. Right: Classification with Imagenet-21k pre-training.

The proposed approach gives comparable or better performance on both ImageNet-1k and ImageNet-V2.

2.5. SOTA Comparison on Self-Supervised Pretraining

Comparison of self-supervised pre-training with the proposed approach

For an equivalent number of epochs, the proposed approach gives comparable performance on ImageNet-1k and better on ImageNet-V2.

Transfer learning performance on 6 datasets with different test-time crop ratio. ViT-B pre-trained on ImageNet-1k at resolution 224.

2.6. Downstream Performance

Left: Different transfer learning tasks with ImageNet-1k pre-training. Right: ADE20K semantic segmentation performance using UperNet

Left: On iNaturalist, this parameter has a significant impact on the performance. By finetuning only the attention layers for transfer learning experiments on Flowers, this improves performance by 0.2%.

Right: Vanilla ViTs trained with the proposed training recipes have a better FLOPs-accuracy trade-off than recent architectures like XCiT or Swin.

2.7. Training With Others Architectures

The performance reached with the proposed training recipe with 400 epochs at resolution 224 × 224 for other Transformers architectures.
  • For some architectures like PiT or CaiT, the proposed training method will improve the performance.
  • For some others like TNT, the proposed approach is neutral and for architectures like Swin, it decreases the performance.
  • This is consistent with the findings of ResNet Strikes Back and illustrates the need to improve the training procedure in conjunction to the architecture to obtain robust conclusions.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.