Review: Vision Transformer (ViT)

An Image is Worth 16x16 Words Transformers for Image Recognition at Scale

Sik-Ho Tsang
6 min readFeb 4, 2022

An Image is Worth 16x16 Words Transformers for Image Recognition at Scale, Vision Transformer, ViT, by Google Research, Brain Team
2021 ICLR, Over 2400 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Transformer, Vision Transformer

  • Transformer architecture has become the de-facto standard for natural language processing tasks.
  • Vision Transformer (ViT), a pure Transformer applied directly to sequences of image patches can perform very well on image classification tasks, where CNN is not necessary.


  1. Vision Transformer (ViT)
  2. Hybrid Architecture
  3. Some Training Details
  4. Experimental Results

1. Vision Transformer (ViT)

Vision Transformer (ViT) Network Architecture
  • To handle 2D images, the image x is reshaped from H×W×C into a sequence of flattened 2D patches xp, with the shape of N×(×C), where (H, W) is the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N=HW/P² is the resulting number of patches.
  • Eq. 1: The Transformer uses constant latent vector size D through all of its layers, so the patches are flattened and map to D dimensions with a trainable linear projection. The output of this projection as the patch embeddings.
  • Similar to BERT’s [class] token, a learnable embedding is prepended to the sequence of embedded patches (z00=xclass)
  • Eq. 4: The state at the output of the Transformer encoder (z0L) serves as the image representation y.
  • Both during pre-training and fine-tuning, a classification head is attached to z0L. The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.
  • Position embeddings are added to the patch embeddings to retain positional information. Standard learnable 1D position embeddings is used.
  • Eq. 2, 3: The Transformer encoder consists of alternating layers of multiheaded self-attention (MSA) and MLP blocks.
  • Layernorm (LN) is applied before every block, and residual connections after every block. The MLP contains two layers with a GELU non-linearity.
Details of Vision Transformer model variants
  • The “Base” and “Large” models are directly adopted from BERT and the larger “Huge” model is added.
  • ViT-L/16 means the “Large” variant with 16×16 input patch size. Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, and models with smaller patch size are computationally more expensive.

Inductive Bias

  • Vision Transformer has much less image-specific inductive bias than CNNs.
  • (Please feel free to read Transformer and BERT if interested.)

2. Hybrid Architecture

  • As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN.
  • Eq. 1: In this hybrid model, the patch embedding projection E is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1×1.
  • The classification input embedding and position embeddings are added as described above.
  • The representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid are to be evaluated.

3. Some Training Details

  • Typically, ViT is pre-trained on large datasets, and fine-tuned to (smaller) downstream tasks. For this, the pre-trained prediction head is removed and a zero-initialized D×K feedforward layer is attached, where K is the number of downstream classes.
  • It is often beneficial to fine-tune at higher resolution than pre-training, as in FixRes. When feeding images of higher resolution, the patch size is kept the same, which results in a larger effective sequence length.
  • The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. Therefore 2D interpolation of the pre-trained position embeddings is performed, according to their location in the original image. (This resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.)

4. Experimental Results

4.1. SOTA Comparison

Comparison with state of the art on popular image classification benchmarks. The number of TPUv3-core-days is the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.

The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L on all tasks, while requiring substantially less computational resources to train.

  • The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets.
  • ViT models still took substantially less compute to pre-train.
Breakdown of VTAB performance in Natural, Specialized, and Structured task groups.
  • BiT, VIVI — a ResNet co-trained on ImageNet and YouTube, and S4L — supervised plus semi-supervised learning on ImageNet.

ViT-H/14 outperforms BiT-R152×4, and other methods, on the Natural and Structured tasks.

On the Specialized the performance of the top two models is similar.

4.2. Pretraining Data Requirement

Left: Transfer to ImageNet, Right: Linear few-shot evaluation on ImageNet versus pre-training size (ViT-b is ViT-B with all hidden dimensions halved.)
  • Left: When pre-trained on the smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization.
  • With ImageNet-21k pre-training, their performances are similar.
  • Only with JFT-300M, we can see the full benefit of larger models.
  • The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.
  • Right: Random subsets of 9M, 30M, and 90M as well as the full JFT-300M dataset for pretraining.
  • ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets.
  • The same is true for ResNet152×2 and ViT-L/16.

This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is sufficient, even beneficial.

4.3. Scaling Study

Performance versus pre-training compute for different architectures
  • The number of pretraining epochs is from 7 to 14, so that data size does not bottleneck the models’ performances.
  1. Vision Transformers dominate ResNets on the performance/compute trade-off. ViT uses approximately 2-4× less compute to attain the same performance (average over 5 datasets).
  2. Hybrids slightly outperform ViT at small computational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size.
  3. Vision Transformers appear not to saturate within the range tried.

4.4. Inspecting Vision Transformer

Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Similarity of position embeddings of ViT-L/32. Right: Size of attended area by head and network depth.
  • Left: The components resemble plausible basis functions for a low-dimensional representation of the fine structure within each patch.
  • Center: The model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings.
  • Right: The average distance in image space across which information is integrated, based on the attention weights. This “attention distance” is analogous to receptive field size in CNNs.
  • Some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model.
Representative examples of attention from the output token to the input space.
Further example attention maps

The model attends to image regions that are semantically relevant for classification.

4.5. Self-Supervision

  • A preliminary exploration is studied on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT.

With self-supervised pre-training, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.