Review — Masked Autoencoders Are Scalable Vision Learners
Observe the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible.
Masked Autoencoders Are Scalable Vision Learners,
Masked Autoencoders (MAE), by Facebook AI Research (FAIR),
2022 CVPR, Over 900 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Autoencoder, BERT, Vision Transformer, ViT
- An asymmetric encoder-decoder architecture is designed, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
- Masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.
- Masked Autoencoders (MAE)
- Experimental Results
- In NLP, BERT proposes Masked Language Modeling (MLM) to mask the text token, and predict it back, which makes the pretraining successful.
But, what makes masked autoencoding different between vision and language?
1.1. Until Recently, Architectures were Different
1.2. Information Density is Different Between Language and Vision
- Languages are human-generated signals that are highly semantic and information-dense.
- Images, on the contrary, are natural signals with heavy spatial redundancy.
High-proportional masking strategy largely reduces redundancy and creates a challenging self-supervisory task that requires holistic understanding beyond low-level image statistics, as shown in the figures at the top.
1.3. The Autoencoder’s Decoder, which Maps the Latent Representation Back to the Input, Plays a Different Role between Reconstructing Text and Images.
- In vision, the decoder reconstructs pixels, which is of a lower semantic level.
- This is in contrast to language, where the decoder predicts missing words that contain rich semantic information.
While in BERT the decoder can be trivial (an MLP), it is found that for images, the decoder design plays a key role in determining the semantic level of the learned latent representations.
- MAE is designed to address the above issues.
2. Masked Autoencoders (MAE)
- Following ViT, an image is divided into regular non-overlapping patches. Then, a subset of patches is sampled following a uniform distribution without replacement. and the remaining ones are masked (i.e., removed).
- Random sampling with a high masking ratio (i.e., the ratio of removed patches) largely eliminates redundancy, thus creating a task that cannot be easily solved by extrapolation.
Finally, the highly sparse input creates an opportunity for designing an efficient encoder, introduced next.
2.2. MAE Encoder
- The encoder embeds patches by a linear projection with added positional embeddings, and then processes the resulting set via a series of Transformer blocks.
- However, the encoder only operates on a small subset (e.g., 25%) of the full set. Masked patches are removed; no mask tokens are used.
- This allows us to train very large encoders with only a fraction of compute and memory.
2.3. MAE Decoder
- The input to the MAE decoder is the full set of tokens consisting of (i) encoded visible patches, and (ii) mask tokens. Each mask token is a shared, learned vector.
- Positional embeddings are added to all tokens in this full set.
- The decoder has another series of Transformer blocks.
- The MAE decoder is only used during pre-training.
- The default lightweight decoder has <10% computation per token vs. the encoder, which significantly reduces pre-training time.
2.4. Reconstruction Target
- Each element in the decoder’s output is a vector of pixel values representing a patch. The last layer of the decoder is a linear projection whose number of output channels equals the number of pixel values in a patch. The decoder’s output is reshaped to form a reconstructed image.
- The loss function computes the mean squared error (MSE) between the reconstructed and original images in the pixel space. The loss is only on masked patches, similar to BERT.
2.5. Simple Implementation
- Randomly shuffle is performed onto the list of tokens and the last portion of the list is removed, based on the masking ratio. This process produces a small subset of tokens for the encoder and is equivalent to sampling patches without replacement.
- After encoding, a list of mask tokens is appended to the list of encoded patches, and this full list (inverting the random shuffle operation)is unshuffled to align all tokens with their targets.
- Then, the decoder is applied to this full list.
3. Experimental Results
- ViT-Large (ViT-L/16) is used as the backbone. The re-implementation using better training recipe obtain 82.5% Top-1 accuracy.
With MAE, 84.9% accuracy is obtained.
3.2. Ablation Study
The model infers missing patches to produce different, yet plausible output.
A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom).
Other ablation studies are also performed as above.
Random mask sampling is the best.
A longer training schedule gives a noticeable improvement.
3.3. SOTA Comparisons
- MAE can scale up easily and has shown steady improvement from bigger models. 86.9% accuracy is obtained using ViT-H (224 size). By fine-tuning with a 448 size, MAE achieves 87.8% accuracy, using only IN1K data.
Comparing with BEiT, MAE is more accurate while being simpler and faster.
3.4. Comparisons with Supervised Pre-Training
MAE pre-training, using only IN1K, can generalize better. It follows a trend similar to the JFT-300M supervised pre-training.
3.5. Partial Fine-Tuning
- Notably, fine-tuning only one Transformer block boosts the accuracy significantly from 73.5% to 81.0%.
Moreover, if MAE is fine-tuned only “half” of the last block (i.e., its MLP sub-block), 79.1% is obtained, much better than linear probing. This variant is essentially fine-tuning an MLP head. Fine-tuning a few blocks (e.g., 4 or 6) can achieve accuracy close to full fine-tuning.
3.6. Transfer Learning
On COCO, ADE20K, iNat, Places365, MAE also obtains the better performance.
[2022 CVPR] [Masked Autoencoders (MAE)]
Masked Autoencoders Are Scalable Vision Learners