Review — SimMIM: a Simple Framework for Masked Image Modeling
SimMIM, Directly Predicting Masked Pixels
SimMIM: A Simple Framework for Masked Image Modeling,
SimMIM, by Tsinghua University, Microsoft Research Asia, and Xi’an Jiaotong University,
2022 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT]
==== My Other Paper Readings Are Also Over Here ====
- A self-supervised learning approach, SimMIM, is proposed where:
- Random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task.
- Predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs.
- The prediction head can be as light as a linear layer, with no worse performance than heavier ones.
Outline
- SimMIM
- Results
1. SimMIM
1.1. Masking Strategy
- A patch-aligned random masking strategy is used where image patches are the basic processing units of vision Transformers. It is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked.
- For Swin Transformer, it is considered equivalent patch sizes of different resolution stages, 4×4∼32×32, and 32×32 is adopted by default which is the patch size of the last stage.
- For ViT, 32×32 is adopted as the default masked patch size.
1.2. Prediction Head
- Some early works follow AutoEncoders to employ a heavy prediction head (decoder).
- In this paper, the prediction head is made extremely lightweight, as light as a linear layer. Heavier heads are also tried such as a 2-layer MLP, an inverse Swin-T, and an inverse Swin-B.
1.3. Prediction Targets
2. Results
2.1. Ablation Study
- Swin-B is used as default backbone.
The best accuracy of 83.0% is obtained from the proposed simple random masking strategy, which is +0.3% higher than the best of other.
- When a large masked patch size of 32 is adopted, this simple strategy performs stably well on a broad range of masking ratios (10%-70%).
- (a) The AvgDist of all masking strategies is smoothly increased with growing masking ratios.
- When the masked patch size is low, e.g., 4 or 8, the AvgDist is relatively low and grows slowly with increasing masking ratios.
- When the patch size is large, e.g., 64, very small masking ratio (e.g. 10%) still makes relatively large AvgDist.
(b) The prediction distance in masked image modeling is encouraged to be moderate, neither too large nor too small.
A masking ratio of 0.6 on patch size of 32 is used by default, due to its stable performance.
- The training cost of using an inverse Swin-B is 2.3× of that by a linear layer.
A single linear layer head, under a fine-tuning metric, has shown competitive or even the optimal transferring performance.
If the aim is to learn good features for finetuning, the important exploration on head designing in contrastive learning approaches may not be necessary for that of masked image modeling.
- The transferring performance drops only at a low resolution of 6², probably because this option throws too much information away.
A default target resolution of 192² is adopted in the experiments, due to the equally best transferring accuracy and the negligible computation overhead.
The three losses of ℓ1, smooth-ℓ1, and ℓ2 perform similarly well. ℓ1 is good to align the approach to the own nature of visual signals.
The approach predicting the masked area performs significantly better than that recovering all image pixels as 82.8% vs. 81.7%.
2.2. SOTA Comparisons
SimMIM using using ViT-B obtains the best results.
The training efficiency is 2.0×, 1.8×, ∼4.0×, and 1.5× more efficient than that of DINO, MoCo v3, ViT, and BEiT.
2.3. Scaling Experiments with Swin Transformer
With SimMIM pre-training, all of Swin-B, Swin-L, and SwinV2-H achieve significantly higher accuracy than their supervised counterparts.
A 3B SwinV2-G model is trained by using ∼40× smaller data than that of JFT-3B, obtains 90.2% accuracy.
2.4. Visualizations
By random masking moderate parts of the major object, both the shape and texture of masked parts can be well recovered, as shown by the penguin, the mountain, the sailboat, and the persons.
By masking most parts of a major object (larger than 90%), the model can still predict an existence of object by the negligible clues.
Reconstructing all image area makes better looking, however, probably the model capacity is wasted at the recovery of the unmasked area which may not be that useful for fine-tuning.
The details can be much better recovered when the masked patch size is smaller, however, the learnt representations transfer worse. Probably, with smaller patch size, the prediction task can be easily accomplished by close-by pixels or textures.