Review — BEiT: BERT Pre-Training of Image Transformers

BEiT, Pretraining ViT, Using Masked Image Modeling (MIM)

7 min readSep 1, 2022

BEiT: BERT Pre-Training of Image Transformers
BEiT, by Microsoft Research
2022 ICLR, Over 300 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E
==== My Other Paper Readings Are Also Over Here ====

Bidirectional Encoder representation from Image Transformers (BEiT) is proposed, where a masked image modeling (MIM) task to pretrain Vision Transformers.
BEiT first “tokenizes” the original image into visual tokens. Then some image patches are randomly masked and fed into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image.

Outline

BEiT Architecture
BEiT Pretraining: Masked Image Modeling (MIM)
Experimental Results
Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

1. BEiT Architecture

1.1. Overall Approach

Inspired by BERT, a pre-training task is proposed, namely, masked image modeling (MIM).
MIM uses two views for each images, i.e., image patches, and visual tokens.
The image is split into a grid of patches that are the input representation of backbone Transformer.
The image is “tokenized” to discrete visual tokens by the latent codes of discrete VAE, where discrete VAE is from DALL·E.

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

1.2. Image Representation

The images have two views of representations, namely, image patch, and visual tokens. The two types serve as input and output representations during pre-training, respectively.

1.2.1. Image Patches

**Image Patches** (Cut from the first figure)

The 2D image of the size H×W×C is split into a sequence of patches xp {p is from 1 to N) of the size P², with the number of patch N=HW/P² patches.
The image patches xp are flattened into vectors and are linearly projected which is similar to word embeddings in BERT.

Particularly, BEiT splits each 224×224 image into a 14×14 grid of image patches, where each patch is 16×16.

1.2.2. Visual Tokens

**Visual Tokens** (Cut from the first figure)

The image is represented as a sequence of discrete tokens obtained by an “image tokenizer”, instead of raw pixels.

Specifically, the image of the size H×W×C is tokenized into z=[z1, …, zN], where the vocabulary V={1,,, …, |V|} contains discrete token indices.

The image tokenizer learned by discrete variational autoencoder (dVAE), by DALL·E, is directly used.
There are two modules during visual token learning, namely, tokenizer and decoder.
The tokenizer q(z|x) maps image pixels x into discrete tokens z according to a visual codebook (i.e., vocabulary).
The decoder p(x|z) learns to reconstruct the input image x based on the visual tokens z.
The vocabulary size is set to |V| = 8192.

1.3. ViT Backbone

Following ViT, the Transformer backbone network is used.
The input of Transformer is a sequence of image patches xip.
The patches are then linearly projected to obtain patch embeddings Expi.
The standard learnable 1D position embeddings Epos are added to patch embeddings:

The encoder contains L layers of Transformer blocks:

The output vectors of the last layer is:

which are used as the encoded representations for the image patches, where hLi is the vector of the i-th image patch.

ViTBase is used, which is a 12-layer Transformer with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072.

2. BEiT Pretraining: Masked Image Modeling (MIM)

2.1. Masked Image Modeling (MIM)

**BEiT Masked Image Modeling (MIM)** (Cut from the first figure)

After splitting the image into image patches, as described above, approximately 40% image patches are randomly masked, where the masked positions are denoted as M. The masked patches are replaced with a learnable embedding e[M]. In BEiT, At most 75 patches are masked.
Then, the good and masked image patches are input into the L-layer Transformer.
A softmax classifier is used to predict the corresponding visual tokens:

The pre-training objective is to maximize the log-likelihood of the correct visual tokens zi given the corrupted image:

BEiT is pretrained on the training set of ImageNet-1K.
The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. The 500k training steps take about five days using 16 Nvidia Tesla V100 32GB GPU cards.

2.2. Blockwise Masking

Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of masking each patch individually in a random manner.

2.3. From VAE Perspective

The BEiT pre-training can be viewed as variational autoencoder training:

In the first stage, the image tokenizer is obtained as a discrete variational autoencoder. Specifically, the first stage minimizes the reconstruction loss, with an uniform prior.
In the second stage, the prior pθ is learnt while keeping qφ and pψ fixed.
Thus, the above equation is re-written as:

where the second term is the proposed BEiT pre-training objective.

3. Experimental Results

3.1. ImageNet-1K & ImageNet-22K Pretraining, Image Classification on ImageNet-1K

**Top-1 accuracy on ImageNet-1K using full fine-tuning**

A simple linear classifier is employed as the task layer. Average pooling is used to aggregate the representations, and the global is fed to a softmax classifier.
Pre-trained BEiT significantly improves performance on both datasets.

BEiT improves the performance on ImageNet, which shows the effectiveness under the rich-resource setting.

Higher resolution improves the BEiT results by 1+ points on ImageNet.

More importantly, BEiT384 pretrained on ImageNet-1K even outperforms supervised pre-training ViT384 that uses ImageNet-22K, when they use the same input resolution.

**Convergence curves of training** **DeiT** **from scratch and fine-tuning BEiT on ImageNet-1K**

Fine-tuning BEiT not only achieves better performance, but also converging much faster than training DeiT from scratch.

3.2. Semantic Segmentation on ADE20K

**Results of semantic segmentation on** **ADE20K**

The task layer used in SETR-PUP (Zheng et al., 2020), is used.
To be specific, the pretrained BEiT is used as a backbone encoder, and several deconvolution layers are incorporated as decoder to produce segmentation.

BEiT achieves better performance than supervised pretraining, although BEiT does not require manual annotations for pre-training.

Intermediate fine-tuning is performed for BEiT on ImageNet, i.e., first pretrained BEiT is fine-tuned on ImageNet, and then the model is fine-tuned on ADE20K.

Intermediate fine-tuning further improves BEiT on semantic segmentation.

3.3. Ablation Study

**Ablation studies for BEiT pre-training on image classification and semantic segmentation**

Blockwise masking is beneficial on both tasks, especially on semantic segmentation.
The proposed masked image modeling (MIM) task significantly outperforms naïve pixel-level auto-encoding. The results indicate that the prediction of visual tokens is the key ingredient of BEiT.
Recovering all the visual tokens harms performance on downstream tasks.
Pre-training the model longer (800 epochs) can further improve performance on downstream tasks.

3.4. Analysis on Self-Attention Map

**Self-attention map for different reference points**

The self-attention mechanism in BEiT can separate objects.

After pre-training, BEiT learns to distinguish semantic regions using self-attention heads, without any task-specific supervision. Such knowledge acquired by BEiT potentially improves the generalization ability of fine-tuned models, especially on small-scale datasets.

4. Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

4.1. Effects of LayerScale in CaiT & Relative Position in Shaw NAACL’18

**Ablation studies of architecture variants on image classification and semantic segmentation**

LayerScale in CaiT, and relative position bias in Shaw NAACL’18, improve performance on ImageNet classification and ADE20K semantic segmentation.

4.2. ImageNet

**Top-1 accuracy on ImageNet-1K fine-tuning**

BEiT-L fine-tuned on ImageNet-22K achieves comparable performance with ViT-L trained on Google JFT-3B.

4.3. ADE20K

**Performance comparison on the** **ADE20K** **semantic segmentation**

The BEiT-L model obtains state-of-the-art performance on ADE20K, outperforms Swin Transformer.

DINO applies self-supervised learning on ViT using similar idea as BYOL. BEiT even uses the BERT pretraining concept to have self-supervised learning on ViT.

Reference

[2022 ICLR] [BEiT]
BEiT: BERT Pre-Training of Image Transformers

1.2. Self-Supervised Learning

1993 … 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] 2022 [BEiT]

Review — BEiT: BERT Pre-Training of Image Transformers

BEiT, Pretraining ViT, Using Masked Image Modeling (MIM)

Outline

1. BEiT Architecture

1.1. Overall Approach

1.2. Image Representation

1.2.1. Image Patches

1.2.2. Visual Tokens

1.3. ViT Backbone

2. BEiT Pretraining: Masked Image Modeling (MIM)

2.1. Masked Image Modeling (MIM)

2.2. Blockwise Masking

2.3. From VAE Perspective

3. Experimental Results

3.1. ImageNet-1K & ImageNet-22K Pretraining, Image Classification on ImageNet-1K

3.2. Semantic Segmentation on ADE20K

3.3. Ablation Study

3.4. Analysis on Self-Attention Map

4. Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

4.1. Effects of LayerScale in CaiT & Relative Position in Shaw NAACL’18

4.2. ImageNet

4.3. ADE20K

Reference

1.2. Self-Supervised Learning

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet