Brief Review — SEER 10B: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

RG-10B, SEER With 10B Larger Model Size, Trained on Larger Dataset

Sik-Ho Tsang
4 min readAug 30


SEER 10B (Figure from SEER GitHub)

Vision Models Are More Robust And Fair When Pretrained On Uncurated ImagesWithout Supervision
SEER 10B, RG-10B
, by Meta AI Research, and Inria
2022 arXiv v2, Over 60 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

  • SEER is trained on billions of random images, a much larger dataset without any data pre-processing or prior assumptions about what we want the model to learn.
  • The model size is scaled to dense 10 billion parameters as RG-10B to avoid underfitting on such large dataset.


  1. SEER 10B, RG-10B
  2. Results

1. SEER 10B, RG-10B

1.1. Dataset

Geographical and Gender Distribution
  • 1 billion public and non-EU (to conform to GDPR) Instagram (IG) images are randomly selected for training.
  • The dataset is unfiltered but the resulting geographical and gender distribution is monitored on a subset of randomly selected 10M images.
  • 192 different countries and various genders are represented.

1.2. Model

  • SwAV is used as self-supervised learning (SSL) strategy.
  • RegNet is scaled up to train the above 1 billion image dataset.
  • SwAV with RegNet scaled up, i.e. SEER. But now, it is SEER 10B.
  • Large amount of RegNet variants are designed and trained to find the optimal large RegNet.

At the end, the resolution is kept fixed and the width (and/or depth) of the base model is increased to scale to 10 billion parameters model. For better training speed, authors ultimately kept the depth same and increased the width.

1.3. Some Training Setup

Training Setup
  • [Model sharding] Fully Sharded Data Parallel FSDP [101, 133] training is used, which shards the model such that each layer of the model is sharded across different data parallel workers (GPUs).
  • [Activation Checkpointing Automation] Activation Checkpointing [23] is the technique of trading compute for memory. Activation Checkpointing Automation uses dynamic programming to find optimal point for checkpointing minimizing the impact of memory reduction.
  • [Optimizing Training Speed] Mixed-precision is used for training and the forward pass computations are performed in FP16. Certain special layers such as SyncBatchNorm still use FP32.

2. Results

2.1. Fairness

Gender, Skintone, Age Groups Fairness
  • With a large dataset trained on a large model, fairness is expected to be improved.

SEER models have the lowest disparity between different genders and skintones. As the model size increases, the disparity decreases.

Label Association Fairness

SEER models make the most Human predictions. As the SEER model size increases, the association of the Human labels increases significantly.

Geographical Fairness

The improvement of SEER models over the supervised baseline is smallest for high income households and the American / European regions. At the same time, the relative improvement in accuracy is significant for the other groups.

2.2. Hate Speech Detection

Hate Speech Detection
  • The image features are concatenated with the BERT text features and train an MLP head on top.

SEER outperform supervised-training ImageNet models by more than 2 pts.

2.3. Linear Probe

Linear Probe
Linear Probe
  • For the same model size (RG-128Gf), SEER model trained on random images in the wild outperforms self-supervised models trained on ImageNet on 17 out of 25 tasks.

The best SEER model (RG-10B) also surpassed the best state-of-the-art self-supervised models (any size, approach data, and architecture) on 17 out of 25 tasks and achieves competitive (within 1% accuracy) on 5 out of 8 tasks.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.