Review: Data Efficient Image Transformer (DeiT)

DeiT: Train Vision Transformer (ViT) Without External Data

DeiT (Data-Efficient Image Transformers)
  • Prior Vision Transformer, ViT, needs to be pre-trained with hundreds of millions of images using external data. ViT does not generalize well when trained on insufficient amounts of data.
  • Data-Efficient Image Transformer, DeiT, is proposed. While the architecture is mostly the same as ViT, it is trained on ImageNet only using a single computer in less than 3 days, with no external data.
  • A teacher-student strategy is introduced with distillation token.


  1. Distillation Through Attention
  2. Experimental Results

1. Distillation Through Attention

  • DeiT has the same architecture as ViT except the input token part that having an additional distillation token. But since ViT cannot perform well when trained on insufficient amounts of data. Distillation is a way to train.
  • Two things are considered: hard vs soft distillation, and classical distillation vs distillation token.

1.1. Soft Distillation

  • Soft distilllation minimizes the Kullback-Leibler (KL) divergence between the softmax of the teacher and the softmax of the student model.
  • Let Zt be the logits of the teacher , Zs the logits of the student.
  • The distillation objective is:
  • τ is temperature (Distillation) and λ is coefficient balancing KL divergence and cross entropy loss.

1.2. Hard Label Distillation

  • Hard decision of the teacher is used as true label:
  • The objective associated with this hard-label distillation is:
  • For a given image, the hard label associated with the teacher may change depending on the specific data augmentation.
  • Hard label can still be converted to soft label using label smoothing (Inception-v3). But in this paper, authors do not smooth pseudo-labels provided by the teacher (e.g., in hard distillation).

1.3. Distillation Token

Distillation Process
  • A new distillation token is included. It interacts with the class and patch tokens through the self-attention layers.
  • This distillation token is employed in a similar fashion as the class token, except that on output of the network its objective is to reproduce the (hard) label predicted by the teacher, instead of true label.
  • Both the class and distillation tokens input to the transformers are learned by back-propagation.
  • The distillation embedding allows the model to learn from the output of the teacher, as in a regular distillation.

1.4. Models & Inference

Variants of DeiT architecture
  • DeiT does not use a MLP head for the pre-training but only a linear classifier.
  • DeiT-B is ViT-B architecture using proposed training scheme and distillation token. The parameters of ViT-B (and therefore of DeiT-B) are fixed as D=768, h=12 and d=D/h=64.
  • Two smaller models, namely DeiT-S and DeiT-Ti, are also introduced, for which the number of heads is changed while keeping d fixed.
  • When fine-tuning, larger resolution is used (FixRes). DeiT-B↑384 where 384 is the resulting operating resolution at the end.
  • At test time, both the class or the distillation embeddings produced by the transformer are associated with linear classifiers and able to infer the image label.
  • The referent method is the late fusion of these two separate heads, for which the softmax output is added by the two classifiers to make the prediction.

2. Experimental Results

2.1. ConvNet Teacher

ImageNet-1k top-1 accuracy of the student as a function of the teacher model used for distillation
  • Using a convnet teacher gives better performance than using a transformer.
  • The convnet is a better teacher is probably due to the inductive bias inherited by the transformers through distillation.
  • In all of the subsequent distillation experiments the default teacher is a RegNetY-16GF with 84M parameters, that the same data and same data-augmentation are used as DeiT. This teacher reaches 82.9% top-1 accuracy on ImageNet.

2.2. Distillation Methods

Distillation experiments on ImageNet-1k with DeiT, 300 epochs of pre-training
  • Hard distillation significantly outperforms soft distillation. Hard distillation reaches 83.0% at resolution 224×224, compared to the soft distillation accuracy of 81.8%.
  • The two tokens provide complementary information useful for classification. The classifier on the two tokens is significantly better than the independent class and distillation classifiers, which by themselves already outperform the distillation baseline.
Disagreement analysis between convnet, image transformers and distillated transformers
  • The fraction of sample classified differently for all classifier pairs, i.e., the rate of different decisions, is reported.
Distillation on ImageNet1k with DeiT-B: top-1 accuracy as a function of the training epochs
  • With 300 epochs, the distilled network DeiT-B⚗ is already better than DeiT-B.

2.3. Efficiency vs Accuracy: A Comparison to Convnets

DeiT, Outperforms Vision Transformer (ViT), ⚗: With Distillation (Throughput measured by one 16GB V100 GPU)
  • DeiT is slightly below EfficientNet, which shows that DeiT have almost closed the gap between Vision Transformers and convnets when training with ImageNet only.
  • These results are a major improvement (+6.3% top-1 in a comparable setting) over previous ViT models trained on ImageNet1k only.
  • It also outperforms by 1% (top-1 acc.) the ViT-B model pre-trained on JFT300M at resolution 384 (85.2% vs 84.15%), while being significantly faster to train.
Throughput (images/s) vs accuracy on ImageNet
  • The best model on ImageNet-1k is 85.2% top-1 accuracy outperforms the best ViT-B model pre-trained on JFT-300M and fine-tuned on ImageNet-1k at resolution 384 (84.15%).
  • ImageNet V2 and ImageNet Real are also evaluated.
  • Note, the current state of the art of 88.55% achieved with extra training data is the ViT-H model (632M parameters) trained on JFT-300M and fine-tuned at resolution 512.
  • DeiT-B⚗ and DeiT-B⚗ “384 outperform, by some margin, the state of the art on the trade-off between accuracy and inference time on GPU.

2.4. Transfer Learning to Downstream Tasks

Transformers based models on different transfer learning task with ImageNet pre-training

2.5. Training Details & Ablation

Ablation study on training methods on ImageNet (top-1 acc.)

2.6. Training Time

  • A typical training of 300 epochs takes 37 hours with 2 nodes or 53 hours on a single 8-GPU node for the DeiT-B.
  • As a comparison point, a similar training with a RegNetY-16GF (Radosavovic et al., 2020) (84M parameters) is 20% slower.
  • DeiT-S and DeiT-Ti are trained in less than 3 days on 4 GPU.
  • Then, optionally the model is fine-tuned at a larger resolution. This takes 20 hours on 8 GPUs to fine-tune a DeiT-B model at resolution 384×384, which corresponds to 25 epochs.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store