Brief Review — SplitMask: Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

SplitMask, Split-Brain Auto + BEiT?

Sik-Ho Tsang
4 min readSep 7


SplitMask are more robust to the type and/or size of pre-training data used.

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
, by Meta AI, Inria, and Sorbonne University
2021 arXiv v1, Over 90 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

  • A self-supervised pre-training scenario is considered that only leverages the target task data, especially small datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than ImageNet.
  • A variant of denoising autoencoders similar to BEiT, SplitMask, is proposed, which is more robust to the type and size of the pre-training data.


  1. SplitMask
  2. Results

1. SplitMask


SplitMask is based on three steps: split, inpaint and match.

  • Split: As in standard Vision Transformers (ViTs), an image is first broken down into patches of 16×16 pixels. Then, the patches are split into two disjoint subsets A and B, which are processed independently by a shared deep ViT encoder.

Inpaint: Next, using the patch representations of the subset A and a shallow decoder (e.g. 2 layers), the patches of the subset B are “inpainted” by solving a Masked Image Modeling (MIM) task (BEiT), and vice versa.

  • Finally, a global image descriptor is obtained by average pooling of the patch representations from the decoder output corresponding to each branch.

Match: Two representations xa and xb, corresponding to the subsets A and B of observed patches are used to estimate InfoNCE loss (CPCv1):

The motivation for adding this contrastive loss is to encourage the model to produce globally coherent features that are consistent across different choices of observed subsets without relying on any hand-designed transformations.

2. Results

2.1. Ablation Studies

Pretrained on ImageNet, Evaluated on iNat19

Figure 2: Peak performance is achieved using only 5% of the ImageNet samples and adding more samples does not provide additional boost.

Figure 3: Using the 10% ImageNet subset. It can be observed that training for long schedules of nearly 3k epochs, matching the total number of updates for that of full ImageNet with 300 epochs.

Different SSLs

Pre-training with an autoencoder loss, BEiT and SplitMask, such as MIM is robust to the reduction in dataset size. In contrast, like for supervised pre-training, the performance of models pre-trained with DINO self-supervision degrades when training with smaller datasets.

Different Tokenizers
  • Replacing the DALL-E tokenizer by simpler choices does NOT lead to any significant degradation in accuracy.

2.2. COCO

  • Similar pre-training of DINO using COCO images provides a relatively weak performance, only outperforming random initialization.

SplitMask leads to a consistent improvement compared to the BEiT baseline, such as +0.6 box AP when using a ViT-small and +0.3 mask AP for ViT-base backbones.

2.3. ADE20K


Denoising autoencoders can provide a very competitive performance on such a challenging task even when pre-trained using a relatively small sample size of 20k images.

2.4. Small Datasets

Various Small Target Datasets

SplitMask leads to further improvement in performances for multiple datasets: for example, on the iNaturalist 2018 dataset, we see +3.0 in accuracy with a ViT-base model.

2.5. ImageNet


SplitMask provides a strong performance, outperforming both BEiT and MoCo v3 for all backbones.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.