Brief Review — SEER: Self-supervised Pretraining of Visual Features in the Wild

SEER, Using SwAV, Scaling Up to Larger Model RegNet, and Larger Data Size

Sik-Ho Tsang
4 min readAug 28, 2023
SEER by VISSL (Figure from VISSL)

Self-supervised Pretraining of Visual Features in the Wild
SEER, by Facebook AI Research, and Inria
2021 arXiv v2, Over 200 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

  • SElf-supERvised (SEER) model is proposed, which uses SwAV Self-Supervised Learning (SSL) approach, and RegNetY model with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting.
  • Later, a even larger SEER 10B model, RG-10B, is trained using a even larger dataset, and it is released in 2022. A newer article “Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision” is published for this newer model.


  1. SElf-supERvised (SEER)
  2. Results

1. SElf-supERvised (SEER)

1.1. SwAV Self-Supervised Learning (SSL) Strategy

  • SwAV SSL Strategy is used but with model and data scaling up.

1.2. RegNetY

  • The RegNetY-256GF architecture is used. It has 4 stages with stage depths (2, 7, 17, 1) and stage widths (528, 1056, 2904, 7392), leading to a total of 695.5M parameters.
  • It takes 6125ms for a single training iteration over 8,704 images on 512 V100 32GB NVIDIA GPUs.
  • Training this model on 1 billion images requires 114,890 training iterations for a batch size of 8,704 images, summing to 8 days of training over 512 GPUs.

1.3. Reducing Memory Consumption per GPU

  • Gradient checkpointing and mixed precision are used.
  • O1 optimization level from NVIDIA Apex library is used to perform operations like GEMMs and convolutions in 16-bits floating-point precision.

1.4. Optimizing Training Speed

  • The optimized SyncBatchNorm is used.
  • For synchronizing BatchNorm layer across GPUs, process groups are created instead of performing global sync which is slow.
  • The dataloader pre-fetches more training batches leading to higher data throughput than the default PyTorch dataloader.

1.5. Large Scale Pretraining Data

  • For the billion scale pretraining, a dataloader that directly samples random, public, and non-EU images from Instagram is considered.
  • As training online and in the wild, there is no curation or pre-processing on the images, such as hashtag filtering or de-duplication. This dataset is not static and gets refreshed every 90 days, however, the model performance is not degraded.

2. Results

2.1. ImageNet


SEER achieves 84.2% top-1 accuracy on ImageNet, surpassing by +1%, the best existing pretrained model from SimCLRv2.

Low-Shot Learning on ImageNet

SEER achieves a top-1 accuracy of 77.9% with only 10% of ImageNet, which is competitive with these methods (2% gap). On 1% of the data, i.e, 10K images, the gap increases significantly but note that the other methods are using the full ImageNet from pretraining.

Different Architectures

Larger model capacity is needed for large dataset. Overall, RegNets surpass the other architectures.

2.2. Downstream Tasks

Downstream Classification

SEER Self-supervised features transfer better than supervised features regardless of the pretraining data.

Downstream Detection and Segmentation

Self-supervised pretraining outperforms supervised pretraining by 1.5~2 AP points. However, the gap in performances between different architectures is small (0.1~0.5 AP) compared to what is observed on ImageNet.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.