Review — Masked Autoencoders Are Scalable Vision Learners

MAE, Masking Image Patches for Self-Supervised Learning

Sik-Ho Tsang
6 min readDec 23, 2022
Example results on ImageNet validation images. For each triplet, the masked image (left), MAE reconstruction (middle), and the ground-truth (right). The masking ratio is 80%, leaving only 39 out of 196 patches.
Example results on COCO validation images, using an MAE trained on ImageNet (the same model weights as in the above figure). Two right-most examples, which, although different from the ground truth, are semantically plausible.

Observe the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible.

Masked Autoencoders Are Scalable Vision Learners,
Masked Autoencoders (MAE), by Facebook AI Research (FAIR),
2022 CVPR, Over 900 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Autoencoder, BERT, Vision Transformer, ViT

  • An asymmetric encoder-decoder architecture is designed, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
  • Masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.

Outline

  1. Goal
  2. Masked Autoencoders (MAE)
  3. Experimental Results

1. Goal

  • In NLP, BERT proposes Masked Language Modeling (MLM) to mask the text token, and predict it back, which makes the pretraining successful.

But, what makes masked autoencoding different between vision and language?

1.1. Until Recently, Architectures were Different

This architectural gap, however, has been addressed with the introduction of Vision Transformers (ViT) and should no longer present an obstacle.

1.2. Information Density is Different Between Language and Vision

  • Languages are human-generated signals that are highly semantic and information-dense.
  • Images, on the contrary, are natural signals with heavy spatial redundancy.

High-proportional masking strategy largely reduces redundancy and creates a challenging self-supervisory task that requires holistic understanding beyond low-level image statistics, as shown in the figures at the top.

1.3. The Autoencoder’s Decoder, which Maps the Latent Representation Back to the Input, Plays a Different Role between Reconstructing Text and Images.

  • In vision, the decoder reconstructs pixels, which is of a lower semantic level.
  • This is in contrast to language, where the decoder predicts missing words that contain rich semantic information.

While in BERT the decoder can be trivial (an MLP), it is found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.

  • MAE is designed to address the above issues.

2. Masked Autoencoders (MAE)

Masked Autoencoders (MAE)

2.1. Masking

  • Following ViT, an image is divided into regular non-overlapping patches. Then, a subset of patches is sampled following a uniform distribution without replacement. and the remaining ones are masked (i.e., removed).
  • Random sampling with a high masking ratio (i.e., the ratio of removed patches) largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation.

Finally, the highly sparse input creates an opportunity for designing an efficient encoder, introduced next.

2.2. MAE Encoder

  • The encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
  • However, the encoder only operates on a small subset (e.g., 25%) of the full set. Masked patches are removed; no mask tokens are used.
  • This allows us to train very large encoders with only a fraction of compute and memory.

2.3. MAE Decoder

  • The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. Each mask token is a shared, learned vector.
  • Positional embeddings are added to all tokens in this full set.
  • The decoder has another series of Transformer blocks.
  • The MAE decoder is only used during pre-training.
  • The default lightweight decoder has <10% computation per token vs. the encoder, which significantly reduces pre-training time.

2.4. Reconstruction Target

  • Each element in the decoder’s output is a vector of pixel values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of pixel values in a patch. The decoder’s output is reshaped to form a reconstructed image.
  • The loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. The loss is only on masked patches, similar to BERT.

2.5. Simple Implementation

  • Randomly shuffle is performed onto the list of tokens and the last portion of the list is removed, based on the masking ratio. This process produces a small subset of tokens for the encoder and is equivalent to sampling patches without replacement.
  • After encoding, a list of mask tokens is appended to the list of encoded patches, and this full list (inverting the random shuffle operation)is unshuffled to align all tokens with their targets.
  • Then, the decoder is applied to this full list.

3. Experimental Results

3.1. ViT-Large (ViT-L/16)

ViT-L trained from scratch vs. fine-tuned from baseline MAE on ImageNet
  • ViT-Large (ViT-L/16) is used as the backbone. The re-implementation using better training recipe obtain 82.5% Top-1 accuracy.

With MAE, 84.9% accuracy is obtained.

3.2. Ablation Study

Reconstructions of ImageNet validation images using an MAE pre-trained with a masking ratio of 75% but applied on inputs with higher masking ratios

The model infers missing patches to produce different, yet plausible output.

Masking ratio.

A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom).

MAE ablation experiments with ViT-L/16 on ImageNet-1K.

Other ablation studies are also performed as above.

Mask sampling strategies determine the pretext task difficulty, influencing reconstruction quality and representations (Table f as above).

Random mask sampling is the best.

Training schedules.

A longer training schedule gives a noticeable improvement.

3.3. SOTA Comparisons

Comparisons with previous results on ImageNet-1K.
  • MAE can scale up easily and has shown steady improvement from bigger models. 86.9% accuracy is obtained using ViT-H (224 size). By fine-tuning with a 448 size, MAE achieves 87.8% accuracy, using only IN1K data.

Comparing with BEiT, MAE is more accurate while being simpler and faster.

3.4. Comparisons with Supervised Pre-Training

MAE pre-training vs. supervised pre-training

MAE pre-training, using only IN1K, can generalize better. It follows a trend similar to the JFT-300M supervised pre-training.

3.5. Partial Fine-Tuning

Partial fine-tuning results of ViT-L w.r.t. the number of fine-tuned Transformer blocks
  • Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%.

Moreover, if MAE is fine-tuned only “half” of the last block (i.e., its MLP sub-block), 79.1% is obtained, much better than linear probing. This variant is essentially fine-tuning an MLP head. Fine-tuning a few blocks (e.g., 4 or 6) can achieve accuracy close to full fine-tuning.

3.6. Transfer Learning

COCO object detection and segmentation using a ViT Mask R-CNN baseline.
ADE20K semantic segmentation (mIoU) using UPerNet.
Transfer learning accuracy on classification datasets.

On COCO, ADE20K, iNat, Places365, MAE also obtains the better performance.

Reference

[2022 CVPR] [Masked Autoencoders (MAE)]
Masked Autoencoders Are Scalable Vision Learners

1.2. Unsupervised/Self-Supervised Learning

19932021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [Barlow Twins] [W-MSE] [SimSiam+AL] [BYOL+LP] 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.