Review — Scaling and Benchmarking Self-Supervised Visual Representation Learning

Jigsaw & Colorization Benchmarking

5 min readJul 24, 2022

Scaling and Benchmarking Self-Supervised Visual Representation Learning, Goyal ICCV’19, by Facebook AI Research
2019 ICCV, Over 200 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification

Two popular self-supervised approaches, Jigsaw and Colorization, are scaled on various axes (including data size and problem ‘hardness’), e.g.: to 100 million images.

Outline

Jigsaw & Colorization Preliminaries
Axis 1: Scaling the Pretraining Data Size
Axis 2: Scaling the Model Capacity
Axis 3: Scaling the Problem Complexity
Putting It Together
Pre-training and Transfer Domain Relation
Benchmarking

1. Jigsaw & Colorization Preliminaries

1.1. Jigsaw

**Learning image representations by solving** **Jigsaw** **puzzles**

Jigsaw divides an image I into N=9 non-overlapping square patches.
A ‘puzzle’ is then created by shuffling these patches randomly.
Each patch is fed to a N-way Siamese ConvNet with shared parameters to obtain patch representations, to further predict the permutation used to create the puzzle.
As the total number of permutations N! can be large, a fixed subset P of the total N! permutations is used. The prediction problem is reduced to classification into one of |P| classes.

1.2. Colorization

**Learning image representations by** **Colorization**

Colorization learns an image representation by predicting color values of an input ‘grayscale’ image. A model is trained to predict the ab colors using L color.
The output ab space is quantized into a set of discrete bins Q = 313 which reduces the problem to a |Q|-way classification problem. The target ab image Y is soft-encoded into |Q| bins by looking at the K-nearest neighbor bins (default value K=10).
The ConvNet is trained to predict Z^K from the input lightness image X.

2. Axis 1: Scaling the Pretraining Data Size

**A list of self-supervised pre-training datasets**

Various subsets of the YFCC-100M dataset — YFCC-[1, 10, 50, 100] million images, are trained. The problem complexity (|P|=2000, K=10) etc., is kept fixed. It is then transferred to VOC07 classification task.

Increasing the size of pre-training data improves the transfer learning performance for both the Jigsaw and Colorization methods on ResNet-50 and AlexNet.

The performance of the Jigsaw model saturates (log-linearly) as the data scale is increased from 1M to 100M.

3. Axis 2: Scaling the Model Capacity

An important observation is made that the performance gap between AlexNet and ResNet-50 (as a function of the pre-training dataset size) keeps increasing.

This suggests that higher capacity models are needed to take full advantage of the larger pre-training datasets.

4. Axis 3: Scaling the Problem Complexity

For Jigsaw, the number of permutations |P| is varied: [100, 701, 2k, 5k, 10k].
For Colorization, the number of nearest neighbors K for the soft-encoding.

For the Jigsaw approach, we see an improvement in transfer learning performance as the size of the permutation set increases.

ResNet-50 shows a 5 point mAP improvement while AlexNet shows a smaller 1.9 point improvement.

The Colorization approach appears to be less sensitive to changes in problem complexity.

We see 2 point mAP variation across different values of K.

5. Putting It Together

Transfer learning performance increases on all three axes, i.e., increasing problem complexity still gives performance boost on ResNet-50 even at 100M data size.

The performance gains for increasing problem complexity are almost negligible for AlexNet but significantly higher for ResNet-50.

6. Pre-training and Transfer Domain Relation

**Relationship between pre-training and transfer domain**

On the VOC07 classification task, pre-training on ImageNet-22k (14M images) transfers as well as pretraining on YFCC-100M (100M images).
However, on the Places205 classification task, pretraining on YFCC-1M (1M images) transfers as well as pre-training on ImageNet-22k (14M images).

The domain (image distribution) of ImageNet is closer to VOC07 (both are object-centric) whereas YFCC is closer to Places205 (both are scene-centric).

7. Benchmarking

7.1. Image Classification

**ResNet-50 top-1 center-crop accuracy for linear classification on Places205 dataset**

**ResNet-50 Linear SVMs mAP on VOC07 classification**

There is a significant accuracy gap between self-supervised and supervised methods despite the proposed scaling efforts.

7.2. Few-Shot Learning

**Few-Shot Image Classification on the VOC07 and Places205**

The self-supervised features are competitive to their supervised counterpart in few-shot setting on both the datasets.

7.3. Object Detection

**Detection mAP for frozen conv body on VOC07 and VOC07+12 using** **Fast R-CNN** **with** **ResNet-50-C4**

Self-supervised initialization is competitive with the ImageNet pre-trained initialization on VOC07 dataset even when fewer parameters are fine-tuned on the detection task.

**Detection mAP for full fine-tuning on VOC07 and VOC07+12 using** **Fast R-CNN** **with** **ResNet-50-C4**

With full fine-tuning, self-supervised model initialization matches the performance of the supervised initialization on both VOC07 and VOC07+12.

Authors mentioned that future work should focus on designing tasks that are complex enough to exploit large scale data and increased model capacity. The experiments suggest that scaling self-supervision is crucial but there is still a long way to go.

(There are other pretext tasks proposed in recent years, e.g.: iGPT and BEiT.)

Reference

[2019 ICCV] [Goyal ICCV’19]
Scaling and Benchmarking Self-Supervised Visual Representation

Self-Supervised Learning

1993 … 2019 [Goyal ICCV’19] 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe]

Review — Scaling and Benchmarking Self-Supervised Visual Representation Learning

Jigsaw & Colorization Benchmarking

Outline

1. Jigsaw & Colorization Preliminaries

1.1. Jigsaw

1.2. Colorization

2. Axis 1: Scaling the Pretraining Data Size

3. Axis 2: Scaling the Model Capacity

4. Axis 3: Scaling the Problem Complexity

5. Putting It Together

6. Pre-training and Transfer Domain Relation

7. Benchmarking

7.1. Image Classification

7.2. Few-Shot Learning

7.3. Object Detection

Reference

Self-Supervised Learning

My Other Previous Paper Readings

Written by Sik-Ho Tsang