Review — BEiT: BERT Pre-Training of Image Transformers

BEiT, Pretraining ViT, Using Masked Image Modeling (MIM)

Sik-Ho Tsang
7 min readSep 1, 2022

BEiT: BERT Pre-Training of Image Transformers
, by Microsoft Research
2022 ICLR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E
==== My Other Paper Readings Are Also Over Here ====

  • Bidirectional Encoder representation from Image Transformers (BEiT) is proposed, where a masked image modeling (MIM) task to pretrain Vision Transformers.
  • BEiT first “tokenizes” the original image into visual tokens. Then some image patches are randomly masked and fed into the backbone Transformer.
  • The pre-training objective is to recover the original visual tokens based on the corrupted image.


  1. BEiT Architecture
  2. BEiT Pretraining: Masked Image Modeling (MIM)
  3. Experimental Results
  4. Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

1. BEiT Architecture

Overview of BEiT pre-training

1.1. Overall Approach

  • Inspired by BERT, a pre-training task is proposed, namely, masked image modeling (MIM).
  • MIM uses two views for each images, i.e., image patches, and visual tokens.
  • The image is split into a grid of patches that are the input representation of backbone Transformer.
  • The image is “tokenized” to discrete visual tokens by the latent codes of discrete VAE, where discrete VAE is from DALL·E.

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

1.2. Image Representation

  • The images have two views of representations, namely, image patch, and visual tokens. The two types serve as input and output representations during pre-training, respectively.

1.2.1. Image Patches

Image Patches (Cut from the first figure)
  • The 2D image of the size H×W×C is split into a sequence of patches xp {p is from 1 to N) of the size, with the number of patch N=HW/ patches.
  • The image patches xp are flattened into vectors and are linearly projected which is similar to word embeddings in BERT.

Particularly, BEiT splits each 224×224 image into a 14×14 grid of image patches, where each patch is 16×16.

1.2.2. Visual Tokens

Visual Tokens (Cut from the first figure)
  • The image is represented as a sequence of discrete tokens obtained by an “image tokenizer”, instead of raw pixels.

Specifically, the image of the size H×W×C is tokenized into z=[z1, …, zN], where the vocabulary V={1,,, …, |V|} contains discrete token indices.

  • The image tokenizer learned by discrete variational autoencoder (dVAE), by DALL·E, is directly used.
  • There are two modules during visual token learning, namely, tokenizer and decoder.
  • The tokenizer q(z|x) maps image pixels x into discrete tokens z according to a visual codebook (i.e., vocabulary).
  • The decoder p(x|z) learns to reconstruct the input image x based on the visual tokens z.
  • The vocabulary size is set to |V| = 8192.

1.3. ViT Backbone

  • Following ViT, the Transformer backbone network is used.
  • The input of Transformer is a sequence of image patches xip.
  • The patches are then linearly projected to obtain patch embeddings Expi.
  • The standard learnable 1D position embeddings Epos are added to patch embeddings:
  • The output vectors of the last layer is:

which are used as the encoded representations for the image patches, where hLi is the vector of the i-th image patch.

  • ViTBase is used, which is a 12-layer Transformer with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072.

2. BEiT Pretraining: Masked Image Modeling (MIM)

2.1. Masked Image Modeling (MIM)

BEiT Masked Image Modeling (MIM) (Cut from the first figure)
  • After splitting the image into image patches, as described above, approximately 40% image patches are randomly masked, where the masked positions are denoted as M. The masked patches are replaced with a learnable embedding e[M]. In BEiT, At most 75 patches are masked.
  • Then, the good and masked image patches are input into the L-layer Transformer.
  • A softmax classifier is used to predict the corresponding visual tokens:

The pre-training objective is to maximize the log-likelihood of the correct visual tokens zi given the corrupted image:

  • BEiT is pretrained on the training set of ImageNet-1K.
  • The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. The 500k training steps take about five days using 16 Nvidia Tesla V100 32GB GPU cards.

2.2. Blockwise Masking

Blockwise Masking
  • Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of masking each patch individually in a random manner.

2.3. From VAE Perspective

  • The BEiT pre-training can be viewed as variational autoencoder training:
  • In the first stage, the image tokenizer is obtained as a discrete variational autoencoder. Specifically, the first stage minimizes the reconstruction loss, with an uniform prior.
  • In the second stage, the prior is learnt while keeping and fixed.
  • Thus, the above equation is re-written as:

where the second term is the proposed BEiT pre-training objective.

3. Experimental Results

3.1. ImageNet-1K & ImageNet-22K Pretraining, Image Classification on ImageNet-1K

Top-1 accuracy on ImageNet-1K using full fine-tuning
  • A simple linear classifier is employed as the task layer. Average pooling is used to aggregate the representations, and the global is fed to a softmax classifier.
  • Pre-trained BEiT significantly improves performance on both datasets.

BEiT improves the performance on ImageNet, which shows the effectiveness under the rich-resource setting.

  • Higher resolution improves the BEiT results by 1+ points on ImageNet.

More importantly, BEiT384 pretrained on ImageNet-1K even outperforms supervised pre-training ViT384 that uses ImageNet-22K, when they use the same input resolution.

Convergence curves of training DeiT from scratch and fine-tuning BEiT on ImageNet-1K

Fine-tuning BEiT not only achieves better performance, but also converging much faster than training DeiT from scratch.

3.2. Semantic Segmentation on ADE20K

Results of semantic segmentation on ADE20K
  • The task layer used in SETR-PUP (Zheng et al., 2020), is used.
  • To be specific, the pretrained BEiT is used as a backbone encoder, and several deconvolution layers are incorporated as decoder to produce segmentation.

BEiT achieves better performance than supervised pretraining, although BEiT does not require manual annotations for pre-training.

  • Intermediate fine-tuning is performed for BEiT on ImageNet, i.e., first pretrained BEiT is fine-tuned on ImageNet, and then the model is fine-tuned on ADE20K.

Intermediate fine-tuning further improves BEiT on semantic segmentation.

3.3. Ablation Study

Ablation studies for BEiT pre-training on image classification and semantic segmentation
  • Blockwise masking is beneficial on both tasks, especially on semantic segmentation.
  • The proposed masked image modeling (MIM) task significantly outperforms naïve pixel-level auto-encoding. The results indicate that the prediction of visual tokens is the key ingredient of BEiT.
  • Recovering all the visual tokens harms performance on downstream tasks.
  • Pre-training the model longer (800 epochs) can further improve performance on downstream tasks.

3.4. Analysis on Self-Attention Map

Self-attention map for different reference points
  • The self-attention mechanism in BEiT can separate objects.

After pre-training, BEiT learns to distinguish semantic regions using self-attention heads, without any task-specific supervision. Such knowledge acquired by BEiT potentially improves the generalization ability of fine-tuned models, especially on small-scale datasets.

4. Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

4.1. Effects of LayerScale in CaiT & Relative Position in Shaw NAACL’18

Ablation studies of architecture variants on image classification and semantic segmentation

LayerScale in CaiT, and relative position bias in Shaw NAACL’18, improve performance on ImageNet classification and ADE20K semantic segmentation.

4.2. ImageNet

Top-1 accuracy on ImageNet-1K fine-tuning

BEiT-L fine-tuned on ImageNet-22K achieves comparable performance with ViT-L trained on Google JFT-3B.

4.3. ADE20K

Performance comparison on the ADE20K semantic segmentation

The BEiT-L model obtains state-of-the-art performance on ADE20K, outperforms Swin Transformer.

DINO applies self-supervised learning on ViT using similar idea as BYOL. BEiT even uses the BERT pretraining concept to have self-supervised learning on ViT.


[2022 ICLR] [BEiT]
BEiT: BERT Pre-Training of Image Transformers

1.2. Self-Supervised Learning

19932021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] 2022 [BEiT]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.