Brief Review — SplitMask: Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
SplitMask, by Meta AI, Inria, and Sorbonne University
2021 arXiv v1, Over 90 Citations (Sik-Ho Tsang @ Medium)
- A self-supervised pre-training scenario is considered that only leverages the target task data, especially small datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than ImageNet.
- A variant of denoising autoencoders similar to BEiT, SplitMask, is proposed, which is more robust to the type and size of the pre-training data.
SplitMask is based on three steps: split, inpaint and match.
- Split: As in standard Vision Transformers (ViTs), an image is first broken down into patches of 16×16 pixels. Then, the patches are split into two disjoint subsets A and B, which are processed independently by a shared deep ViT encoder.
Inpaint: Next, using the patch representations of the subset A and a shallow decoder (e.g. 2 layers), the patches of the subset B are “inpainted” by solving a Masked Image Modeling (MIM) task (BEiT), and vice versa.
- Finally, a global image descriptor is obtained by average pooling of the patch representations from the decoder output corresponding to each branch.
Match: Two representations xa and xb, corresponding to the subsets A and B of observed patches are used to estimate InfoNCE loss (CPCv1):
The motivation for adding this contrastive loss is to encourage the model to produce globally coherent features that are consistent across different choices of observed subsets without relying on any hand-designed transformations.
2.1. Ablation Studies
Figure 2: Peak performance is achieved using only 5% of the ImageNet samples and adding more samples does not provide additional boost.
Figure 3: Using the 10% ImageNet subset. It can be observed that training for long schedules of nearly 3k epochs, matching the total number of updates for that of full ImageNet with 300 epochs.
Pre-training with an autoencoder loss, BEiT and SplitMask, such as MIM is robust to the reduction in dataset size. In contrast, like for supervised pre-training, the performance of models pre-trained with DINO self-supervision degrades when training with smaller datasets.
- Replacing the DALL-E tokenizer by simpler choices does NOT lead to any significant degradation in accuracy.
- Similar pre-training of DINO using COCO images provides a relatively weak performance, only outperforming random initialization.
Denoising autoencoders can provide a very competitive performance on such a challenging task even when pre-trained using a relatively small sample size of 20k images.
2.4. Small Datasets
SplitMask leads to further improvement in performances for multiple datasets: for example, on the iNaturalist 2018 dataset, we see +3.0 in accuracy with a ViT-base model.