Review — SimMIM: a Simple Framework for Masked Image Modeling

SimMIM, Directly Predicting Masked Pixels

Sik-Ho Tsang
5 min readApr 3, 2023

SimMIM: A Simple Framework for Masked Image Modeling,
SimMIM, by Tsinghua University, Microsoft Research Asia, and Xi’an Jiaotong University,
2022 CVPR, Over 300 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT]
==== My Other Paper Readings Are Also Over Here ====

  • A self-supervised learning approach, SimMIM, is proposed where:
  1. Random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task.
  2. Predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs.
  3. The prediction head can be as light as a linear layer, with no worse performance than heavier ones.


  1. SimMIM
  2. Results

1. SimMIM

An illustration of our simple framework for masked language modeling, named SimMIM

1.1. Masking Strategy

Illustration of masking area generated by different masking strategies using a same mask ratio of 0.6 on different patch sizes (e.g., 4, 8, 16 and 32).
  • A patch-aligned random masking strategy is used where image patches are the basic processing units of vision Transformers. It is convenient to operate the masking on patch-level that a patch is either fully visible or fully masked.
  • For Swin Transformer, it is considered equivalent patch sizes of different resolution stages, 4×4∼32×32, and 32×32 is adopted by default which is the patch size of the last stage.
  • For ViT, 32×32 is adopted as the default masked patch size.

1.2. Prediction Head

  • Some early works follow AutoEncoders to employ a heavy prediction head (decoder).
  • In this paper, the prediction head is made extremely lightweight, as light as a linear layer. Heavier heads are also tried such as a 2-layer MLP, an inverse Swin-T, and an inverse Swin-B.

1.3. Prediction Targets

  • In iGPT, pixel clustering is performed, and iGPT is to predict the cluster of the pixel. In BEiT, visual tokens are predicted.
  • In this paper, raw pixel value regression is performed.
  • An ℓ1-loss is employed on the masked pixels:

2. Results

2.1. Ablation Study

  • Swin-B is used as default backbone.
Ablation on different masking strategies with different masked patch sizes

The best accuracy of 83.0% is obtained from the proposed simple random masking strategy, which is +0.3% higher than the best of other.

  • When a large masked patch size of 32 is adopted, this simple strategy performs stably well on a broad range of masking ratios (10%-70%).
(a) AvgDist (averaged distance of masked pixels to the nearest visible pixels) w.r.t. different masking ratios using different masking strategies and different masked patch sizes; (b) finetuning performance (top-1 accuracy) w.r.t. AvgDist.
  • (a) The AvgDist of all masking strategies is smoothly increased with growing masking ratios.
  • When the masked patch size is low, e.g., 4 or 8, the AvgDist is relatively low and grows slowly with increasing masking ratios.
  • When the patch size is large, e.g., 64, very small masking ratio (e.g. 10%) still makes relatively large AvgDist.

(b) The prediction distance in masked image modeling is encouraged to be moderate, neither too large nor too small.

A masking ratio of 0.6 on patch size of 32 is used by default, due to its stable performance.

Ablation on different prediction heads.
  • The training cost of using an inverse Swin-B is 2.3× of that by a linear layer.

A single linear layer head, under a fine-tuning metric, has shown competitive or even the optimal transferring performance.

If the aim is to learn good features for finetuning, the important exploration on head designing in contrastive learning approaches may not be necessary for that of masked image modeling.

Ablation on different prediction resolutions.
  • The transferring performance drops only at a low resolution of 6², probably because this option throws too much information away.

A default target resolution of 192² is adopted in the experiments, due to the equally best transferring accuracy and the negligible computation overhead.

Ablation on different prediction targets.

The three losses of ℓ1, smooth-ℓ1, and ℓ2 perform similarly well. ℓ1 is good to align the approach to the own nature of visual signals.

Ablation on different performing areas of prediction loss.

The approach predicting the masked area performs significantly better than that recovering all image pixels as 82.8% vs. 81.7%.

2.2. SOTA Comparisons

System-level comparison using ViT-B as the encoder.

SimMIM using using ViT-B obtains the best results.

The training efficiency is 2.0×, 1.8×, ∼4.0×, and 1.5× more efficient than that of DINO, MoCo v3, ViT, and BEiT.

2.3. Scaling Experiments with Swin Transformer

Scaling experiments with Swin Transformer as backbone architectures.

With SimMIM pre-training, all of Swin-B, Swin-L, and SwinV2-H achieve significantly higher accuracy than their supervised counterparts.

A 3B SwinV2-G model is trained by using ∼40× smaller data than that of JFT-3B, obtains 90.2% accuracy.

2.4. Visualizations

Recovered images using three different mask types (from left to right): random masking, masking most parts of a major object, and masking the full major object.

By random masking moderate parts of the major object, both the shape and texture of masked parts can be well recovered, as shown by the penguin, the mountain, the sailboat, and the persons.

By masking most parts of a major object (larger than 90%), the model can still predict an existence of object by the negligible clues.

Recovered images by two different losses of predicting only the masked area or reconstructing all image area, respectively. from left to right: raw image, masked image, prediction of masked patches only, and reconstruction of all patches, respectively.

Reconstructing all image area makes better looking, however, probably the model capacity is wasted at the recovery of the unmasked area which may not be that useful for fine-tuning.

An example of recovered image using masked patch sizes of 4, 8, 16, 32 and 64, and a fixed masked ratio of 0.6.

The details can be much better recovered when the masked patch size is smaller, however, the learnt representations transfer worse. Probably, with smaller patch size, the prediction task can be easily accomplished by close-by pixels or textures.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.