Review — Scaling and Benchmarking Self-Supervised Visual Representation Learning

Jigsaw & Colorization Benchmarking

Sik-Ho Tsang
5 min readJul 24, 2022

Scaling and Benchmarking Self-Supervised Visual Representation Learning, Goyal ICCV’19, by Facebook AI Research
2019 ICCV, Over 200 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification

  • Two popular self-supervised approaches, Jigsaw and Colorization, are scaled on various axes (including data size and problem ‘hardness’), e.g.: to 100 million images.


  1. Jigsaw & Colorization Preliminaries
  2. Axis 1: Scaling the Pretraining Data Size
  3. Axis 2: Scaling the Model Capacity
  4. Axis 3: Scaling the Problem Complexity
  5. Putting It Together
  6. Pre-training and Transfer Domain Relation
  7. Benchmarking

1. Jigsaw & Colorization Preliminaries

1.1. Jigsaw

Learning image representations by solving Jigsaw puzzles
  • Jigsaw divides an image I into N=9 non-overlapping square patches.
  • A ‘puzzle’ is then created by shuffling these patches randomly.
  • Each patch is fed to a N-way Siamese ConvNet with shared parameters to obtain patch representations, to further predict the permutation used to create the puzzle.
  • As the total number of permutations N! can be large, a fixed subset P of the total N! permutations is used. The prediction problem is reduced to classification into one of |P| classes.

1.2. Colorization

Learning image representations by Colorization
  • Colorization learns an image representation by predicting color values of an input ‘grayscale’ image. A model is trained to predict the ab colors using L color.
  • The output ab space is quantized into a set of discrete bins Q = 313 which reduces the problem to a |Q|-way classification problem. The target ab image Y is soft-encoded into |Q| bins by looking at the K-nearest neighbor bins (default value K=10).
  • The ConvNet is trained to predict Z^K from the input lightness image X.

2. Axis 1: Scaling the Pretraining Data Size

A list of self-supervised pre-training datasets
Scaling the Pre-training Data Size
  • Various subsets of the YFCC-100M dataset — YFCC-[1, 10, 50, 100] million images, are trained. The problem complexity (|P|=2000, K=10) etc., is kept fixed. It is then transferred to VOC07 classification task.

Increasing the size of pre-training data improves the transfer learning performance for both the Jigsaw and Colorization methods on ResNet-50 and AlexNet.

  • The performance of the Jigsaw model saturates (log-linearly) as the data scale is increased from 1M to 100M.

3. Axis 2: Scaling the Model Capacity

  • An important observation is made that the performance gap between AlexNet and ResNet-50 (as a function of the pre-training dataset size) keeps increasing.

This suggests that higher capacity models are needed to take full advantage of the larger pre-training datasets.

4. Axis 3: Scaling the Problem Complexity

Scaling Problem Complexity
  • For Jigsaw, the number of permutations |P| is varied: [100, 701, 2k, 5k, 10k].
  • For Colorization, the number of nearest neighbors K for the soft-encoding.

For the Jigsaw approach, we see an improvement in transfer learning performance as the size of the permutation set increases.

  • ResNet-50 shows a 5 point mAP improvement while AlexNet shows a smaller 1.9 point improvement.

The Colorization approach appears to be less sensitive to changes in problem complexity.

  • We see 2 point mAP variation across different values of K.

5. Putting It Together

Scaling Data and Problem Complexity

Transfer learning performance increases on all three axes, i.e., increasing problem complexity still gives performance boost on ResNet-50 even at 100M data size.

  • The performance gains for increasing problem complexity are almost negligible for AlexNet but significantly higher for ResNet-50.

6. Pre-training and Transfer Domain Relation

Relationship between pre-training and transfer domain
  • On the VOC07 classification task, pre-training on ImageNet-22k (14M images) transfers as well as pretraining on YFCC-100M (100M images).
  • However, on the Places205 classification task, pretraining on YFCC-1M (1M images) transfers as well as pre-training on ImageNet-22k (14M images).

The domain (image distribution) of ImageNet is closer to VOC07 (both are object-centric) whereas YFCC is closer to Places205 (both are scene-centric).

7. Benchmarking

7.1. Image Classification

ResNet-50 top-1 center-crop accuracy for linear classification on Places205 dataset
ResNet-50 Linear SVMs mAP on VOC07 classification

There is a significant accuracy gap between self-supervised and supervised methods despite the proposed scaling efforts.

7.2. Few-Shot Learning

Few-Shot Image Classification on the VOC07 and Places205

The self-supervised features are competitive to their supervised counterpart in few-shot setting on both the datasets.

7.3. Object Detection

Detection mAP for frozen conv body on VOC07 and VOC07+12 using Fast R-CNN with ResNet-50-C4

Self-supervised initialization is competitive with the ImageNet pre-trained initialization on VOC07 dataset even when fewer parameters are fine-tuned on the detection task.

Detection mAP for full fine-tuning on VOC07 and VOC07+12 using Fast R-CNN with ResNet-50-C4

With full fine-tuning, self-supervised model initialization matches the performance of the supervised initialization on both VOC07 and VOC07+12.

  • Authors mentioned that future work should focus on designing tasks that are complex enough to exploit large scale data and increased model capacity. The experiments suggest that scaling self-supervision is crucial but there is still a long way to go.

(There are other pretext tasks proposed in recent years, e.g.: iGPT and BEiT.)


[2019 ICCV] [Goyal ICCV’19]
Scaling and Benchmarking Self-Supervised Visual Representation

Self-Supervised Learning

19932019 [Goyal ICCV’19] 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.