Brief Review — iBOT: Image BERT Pre-Training with Online Tokenizer

Masked prediction with an online tokenizer

4 min readApr 14, 2024

**Masked image modeling**. I denotes an image and Tok. denotes a visual tokenizer.

iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT, by ByteDance, Johns Hopkins University, Shanghai Jiao Tong University, UC Santa Cruz
2022 ICLR, Over 570 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec] [SEER 10B, RG-10B]
==== My Other Paper Readings Are Also Over Here ====

iBOT is proposed, which performs masked prediction with an online tokenizer.
Specifically, self-distillation is performed on masked patch tokens and the teacher network is taken as the online tokenizer.
The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pretrained beforehand.

Outline

iBOT (image BERT pre-training with Online Tokenizer)
Results

1. iBOT (image BERT pre-training with Online Tokenizer)

1.1. iBOT Framework

First, blockwise masking is performed on the two augmented views u and v to obtain their masked views ^u and ^v.
Taking ^u as an example for simplicity, the student network outputs for the masked view ^u projections of its patch tokens are:

And the teacher network outputs for the non-masked view u projections of its patch tokens are:

The training objective of masked image modeling (MIM) in iBOT becomes:

The loss is symmetrized by averaging with another CE term between ^vpatchs and vpatch.
The backbone together with the projection head of teacher network is, therefore, a visual tokenizer that generates online token distributions for each masked patch token.
The tokenizer used in iBOT is jointly learnable to MIM objective without a need of being pre-trained in an extra stage, a bonus feature of which is now its domain knowledge can be distilled from the current dataset rather than fixed to the specified dataset.
To ensure that the online tokenizer is semantically-meaningful, self-distillation is performed on [CLS] token of cross-view images.
In practice, iBOT works with L[CLS] proposed in DINO:

L[CLS] and LMIM are summed up without scaling.
The parameters of projection heads for [CLS] token and patch tokens are shared. It is found to better than using separate heads.

1.2. Models

Vision Transformers and Swin Transformers with different amounts of parameters, ViT-S/16, ViT-B/16, ViT-L/16, and Swin-T/{7,14} are used as the backbone f.
The projection head h is a 3-layer MLPs with l2-normalized bottleneck following DINO.
ImageNet-1K or ImageNet-22K is used for self-supervised pretraining.

2. Results

2.1. Classification on ImageNet-1K

Table 1: The linear probing accuracy of 79.5% with ViT-B/16 is comparable to 79.8% by SimCLRv2 but with 10× less parameters. The performance gain over DINO gets larger (0.9% w/ ViT-S versus 1.3% w/ ViT-B) with more parameters, suggesting iBOT is more scalable to larger models.
Table 2: iBOT achieves an 82.3%, 84.0%, and 84.8% top-1 accuracy with ViT-S/16, ViT-B/16, and ViT-L/16, respectively.
Table 3: iBOT pre-trained with ImageNet-22K achieves 84.4% and 86.6% top-1 accuracy with ViT-B/16 and ViT-L/16, respectively, outperforming ImageNet-22K pre-trained BEiT by 0.7% and 0.6%.

**Semi-Supervised, Self-Supervised, Downstream Tasks**

Table 4 (Semi-Supervised): iBOT advances DINO by 1.6% and 0.8% using 1% and 10% data, respectively, suggesting a higher label efficiency.
Table 5 (Self-Supervised): iBOT achieves a 32.8% NMI, outperforming the previous state of the art by 1.8%.
Table 6 (Downstream Tasks): iBOT improves ViT-S’s APb from 46.2 to 49.4 and APm from 40.1 to 42.6, surpassing both supervised Swin-T and its self-supervised counterpart by a nontrivial margin. With ViT-B/16, iBOT achieves an APb of 51.2 and an APm of 44.2, surpassing previous best results by a large margin.

Brief Review — iBOT: Image BERT Pre-Training with Online Tokenizer

Masked prediction with an online tokenizer

Outline

1. iBOT (image BERT pre-training with Online Tokenizer)

1.1. iBOT Framework

1.2. Models

2. Results

2.1. Classification on ImageNet-1K

Written by Sik-Ho Tsang