Brief Review — iBOT: Image BERT Pre-Training with Online Tokenizer
Masked prediction with an online tokenizer
4 min readApr 14, 2024
iBOT: Image BERT Pre-Training with Online Tokenizer
iBOT, by ByteDance, Johns Hopkins University, Shanghai Jiao Tong University, UC Santa Cruz
2022 ICLR, Over 570 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec] [SEER 10B, RG-10B]
==== My Other Paper Readings Are Also Over Here ====
- iBOT is proposed, which performs masked prediction with an online tokenizer.
- Specifically, self-distillation is performed on masked patch tokens and the teacher network is taken as the online tokenizer.
- The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pretrained beforehand.
Outline
- iBOT (image BERT pre-training with Online Tokenizer)
- Results
1. iBOT (image BERT pre-training with Online Tokenizer)
1.1. iBOT Framework
- First, blockwise masking is performed on the two augmented views u and v to obtain their masked views ^u and ^v.
- Taking ^u as an example for simplicity, the student network outputs for the masked view ^u projections of its patch tokens are:
- And the teacher network outputs for the non-masked view u projections of its patch tokens are:
- The training objective of masked image modeling (MIM) in iBOT becomes:
- The loss is symmetrized by averaging with another CE term between ^vpatchs and vpatch.
- The backbone together with the projection head of teacher network is, therefore, a visual tokenizer that generates online token distributions for each masked patch token.
- The tokenizer used in iBOT is jointly learnable to MIM objective without a need of being pre-trained in an extra stage, a bonus feature of which is now its domain knowledge can be distilled from the current dataset rather than fixed to the specified dataset.
- To ensure that the online tokenizer is semantically-meaningful, self-distillation is performed on [CLS] token of cross-view images.
- In practice, iBOT works with L[CLS] proposed in DINO:
- L[CLS] and LMIM are summed up without scaling.
- The parameters of projection heads for [CLS] token and patch tokens are shared. It is found to better than using separate heads.
1.2. Models
- Vision Transformers and Swin Transformers with different amounts of parameters, ViT-S/16, ViT-B/16, ViT-L/16, and Swin-T/{7,14} are used as the backbone f.
- The projection head h is a 3-layer MLPs with l2-normalized bottleneck following DINO.
- ImageNet-1K or ImageNet-22K is used for self-supervised pretraining.
2. Results
2.1. Classification on ImageNet-1K
- Table 1: The linear probing accuracy of 79.5% with ViT-B/16 is comparable to 79.8% by SimCLRv2 but with 10× less parameters. The performance gain over DINO gets larger (0.9% w/ ViT-S versus 1.3% w/ ViT-B) with more parameters, suggesting iBOT is more scalable to larger models.
- Table 2: iBOT achieves an 82.3%, 84.0%, and 84.8% top-1 accuracy with ViT-S/16, ViT-B/16, and ViT-L/16, respectively.
- Table 3: iBOT pre-trained with ImageNet-22K achieves 84.4% and 86.6% top-1 accuracy with ViT-B/16 and ViT-L/16, respectively, outperforming ImageNet-22K pre-trained BEiT by 0.7% and 0.6%.
- Table 4 (Semi-Supervised): iBOT advances DINO by 1.6% and 0.8% using 1% and 10% data, respectively, suggesting a higher label efficiency.
- Table 5 (Self-Supervised): iBOT achieves a 32.8% NMI, outperforming the previous state of the art by 1.8%.
- Table 6 (Downstream Tasks): iBOT improves ViT-S’s APb from 46.2 to 49.4 and APm from 40.1 to 42.6, surpassing both supervised Swin-T and its self-supervised counterpart by a nontrivial margin. With ViT-B/16, iBOT achieves an APb of 51.2 and an APm of 44.2, surpassing previous best results by a large margin.