Brief Review — iBOT: Image BERT Pre-Training with Online Tokenizer

Masked prediction with an online tokenizer

Sik-Ho Tsang
4 min readApr 14, 2024
Masked image modeling. I denotes an image and Tok. denotes a visual tokenizer.

iBOT: Image BERT Pre-Training with Online Tokenizer
, by ByteDance, Johns Hopkins University, Shanghai Jiao Tong University, UC Santa Cruz
2022 ICLR, Over 570 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
1993 …
2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec] [SEER 10B, RG-10B]
==== My Other Paper Readings Are Also Over Here ====

  • iBOT is proposed, which performs masked prediction with an online tokenizer.
  • Specifically, self-distillation is performed on masked patch tokens and the teacher network is taken as the online tokenizer.
  • The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pretrained beforehand.


  1. iBOT (image BERT pre-training with Online Tokenizer)
  2. Results

1. iBOT (image BERT pre-training with Online Tokenizer)

1.1. iBOT Framework

iBOT Framework
  • First, blockwise masking is performed on the two augmented views u and v to obtain their masked views ^u and ^v.
  • Taking ^u as an example for simplicity, the student network outputs for the masked view ^u projections of its patch tokens are:
  • And the teacher network outputs for the non-masked view u projections of its patch tokens are:
  • The training objective of masked image modeling (MIM) in iBOT becomes:
  • The loss is symmetrized by averaging with another CE term between ^vpatchs and vpatch.
  • The backbone together with the projection head of teacher network is, therefore, a visual tokenizer that generates online token distributions for each masked patch token.
  • The tokenizer used in iBOT is jointly learnable to MIM objective without a need of being pre-trained in an extra stage, a bonus feature of which is now its domain knowledge can be distilled from the current dataset rather than fixed to the specified dataset.
  • To ensure that the online tokenizer is semantically-meaningful, self-distillation is performed on [CLS] token of cross-view images.
  • In practice, iBOT works with L[CLS] proposed in DINO:
  • L[CLS] and LMIM are summed up without scaling.
  • The parameters of projection heads for [CLS] token and patch tokens are shared. It is found to better than using separate heads.

1.2. Models

  • Vision Transformers and Swin Transformers with different amounts of parameters, ViT-S/16, ViT-B/16, ViT-L/16, and Swin-T/{7,14} are used as the backbone f.
  • The projection head h is a 3-layer MLPs with l2-normalized bottleneck following DINO.
  • ImageNet-1K or ImageNet-22K is used for self-supervised pretraining.

2. Results

2.1. Classification on ImageNet-1K

Classification on ImageNet-1K
  • Table 1: The linear probing accuracy of 79.5% with ViT-B/16 is comparable to 79.8% by SimCLRv2 but with 10× less parameters. The performance gain over DINO gets larger (0.9% w/ ViT-S versus 1.3% w/ ViT-B) with more parameters, suggesting iBOT is more scalable to larger models.
  • Table 2: iBOT achieves an 82.3%, 84.0%, and 84.8% top-1 accuracy with ViT-S/16, ViT-B/16, and ViT-L/16, respectively.
  • Table 3: iBOT pre-trained with ImageNet-22K achieves 84.4% and 86.6% top-1 accuracy with ViT-B/16 and ViT-L/16, respectively, outperforming ImageNet-22K pre-trained BEiT by 0.7% and 0.6%.
Semi-Supervised, Self-Supervised, Downstream Tasks
  • Table 4 (Semi-Supervised): iBOT advances DINO by 1.6% and 0.8% using 1% and 10% data, respectively, suggesting a higher label efficiency.
  • Table 5 (Self-Supervised): iBOT achieves a 32.8% NMI, outperforming the previous state of the art by 1.8%.
  • Table 6 (Downstream Tasks): iBOT improves ViT-S’s APb from 46.2 to 49.4 and APm from 40.1 to 42.6, surpassing both supervised Swin-T and its self-supervised counterpart by a nontrivial margin. With ViT-B/16, iBOT achieves an APb of 51.2 and an APm of 44.2, surpassing previous best results by a large margin.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.