Review — Masked Autoencoders Are Scalable Vision Learners

MAE, Masking Image Patches for Self-Supervised Learning

6 min readDec 23, 2022

--

**Example results on ImageNet validation images.** For each triplet, the **masked image (left)**, **MAE reconstruction (middle)**, and the ground-truth (right). The **masking ratio** is **80%**, leaving only 39 out of 196 patches.

**Example results on COCO validation images, using an MAE trained on ImageNet** (the same model weights as in the above figure). Two right-most examples, which, although **different from the ground truth**, are **semantically plausible**.

Observe the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible.

Masked Autoencoders Are Scalable Vision Learners,
Masked Autoencoders (MAE), by Facebook AI Research (FAIR),
2022 CVPR, Over 900 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Autoencoder, BERT, Vision Transformer, ViT

An asymmetric encoder-decoder architecture is designed, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
Masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.

Outline

Goal
Masked Autoencoders (MAE)
Experimental Results

1. Goal

In NLP, BERT proposes Masked Language Modeling (MLM) to mask the text token, and predict it back, which makes the pretraining successful.

But, what makes masked autoencoding different between vision and language?

1.1. Until Recently, Architectures were Different

This architectural gap, however, has been addressed with the introduction of Vision Transformers (ViT) and should no longer present an obstacle.

1.2. Information Density is Different Between Language and Vision

Languages are human-generated signals that are highly semantic and information-dense.
Images, on the contrary, are natural signals with heavy spatial redundancy.

High-proportional masking strategy largely reduces redundancy and creates a challenging self-supervisory task that requires holistic understanding beyond low-level image statistics, as shown in the figures at the top.

1.3. The Autoencoder’s Decoder, which Maps the Latent Representation Back to the Input, Plays a Different Role between Reconstructing Text and Images.

In vision, the decoder reconstructs pixels, which is of a lower semantic level.
This is in contrast to language, where the decoder predicts missing words that contain rich semantic information.

While in BERT the decoder can be trivial (an MLP), it is found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.

MAE is designed to address the above issues.

2. Masked Autoencoders (MAE)

2.1. Masking

Following ViT, an image is divided into regular non-overlapping patches. Then, a subset of patches is sampled following a uniform distribution without replacement. and the remaining ones are masked (i.e., removed).
Random sampling with a high masking ratio (i.e., the ratio of removed patches) largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation.

Finally, the highly sparse input creates an opportunity for designing an efficient encoder, introduced next.

2.2. MAE Encoder

The encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
However, the encoder only operates on a small subset (e.g., 25%) of the full set. Masked patches are removed; no mask tokens are used.
This allows us to train very large encoders with only a fraction of compute and memory.

2.3. MAE Decoder

The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. Each mask token is a shared, learned vector.
Positional embeddings are added to all tokens in this full set.
The decoder has another series of Transformer blocks.
The MAE decoder is only used during pre-training.
The default lightweight decoder has <10% computation per token vs. the encoder, which significantly reduces pre-training time.

2.4. Reconstruction Target

Each element in the decoder’s output is a vector of pixel values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of pixel values in a patch. The decoder’s output is reshaped to form a reconstructed image.
The loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. The loss is only on masked patches, similar to BERT.

2.5. Simple Implementation

Randomly shuffle is performed onto the list of tokens and the last portion of the list is removed, based on the masking ratio. This process produces a small subset of tokens for the encoder and is equivalent to sampling patches without replacement.
After encoding, a list of mask tokens is appended to the list of encoded patches, and this full list (inverting the random shuffle operation)is unshuffled to align all tokens with their targets.
Then, the decoder is applied to this full list.

3. Experimental Results

3.1. ViT-Large (ViT-L/16)

ViT-L trained from scratch vs. fine-tuned from baseline MAE on ImageNet

ViT-Large (ViT-L/16) is used as the backbone. The re-implementation using better training recipe obtain 82.5% Top-1 accuracy.

With MAE, 84.9% accuracy is obtained.

3.2. Ablation Study

**Reconstructions of ImageNet validation images using an MAE pre-trained with a masking ratio of 75% but applied on inputs with higher masking ratios**

The model infers missing patches to produce different, yet plausible output.

A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom).

**MAE ablation experiments with** **ViT-L/16 on ImageNet-1K.**

Other ablation studies are also performed as above.

**Mask sampling strategies** determine the pretext task difficulty, influencing reconstruction quality and representations **(Table f as above).**

Random mask sampling is the best.

A longer training schedule gives a noticeable improvement.

3.3. SOTA Comparisons

**Comparisons with previous results on ImageNet-1K.**

MAE can scale up easily and has shown steady improvement from bigger models. 86.9% accuracy is obtained using ViT-H (224 size). By fine-tuning with a 448 size, MAE achieves 87.8% accuracy, using only IN1K data.

Comparing with BEiT, MAE is more accurate while being simpler and faster.

3.4. Comparisons with Supervised Pre-Training

**MAE pre-training vs. supervised pre-training**

MAE pre-training, using only IN1K, can generalize better. It follows a trend similar to the JFT-300M supervised pre-training.

3.5. Partial Fine-Tuning

**Partial fine-tuning results of** **ViT-L w.r.t. the number of fine-tuned** **Transformer** **blocks**

Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%.

Moreover, if MAE is fine-tuned only “half” of the last block (i.e., its MLP sub-block), 79.1% is obtained, much better than linear probing. This variant is essentially fine-tuning an MLP head. Fine-tuning a few blocks (e.g., 4 or 6) can achieve accuracy close to full fine-tuning.