Review — Scaling and Benchmarking Self-Supervised Visual Representation Learning
Scaling and Benchmarking Self-Supervised Visual Representation Learning, Goyal ICCV’19, by Facebook AI Research
2019 ICCV, Over 200 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification
1. Jigsaw & Colorization Preliminaries
- Jigsaw divides an image I into N=9 non-overlapping square patches.
- A ‘puzzle’ is then created by shuffling these patches randomly.
- Each patch is fed to a N-way Siamese ConvNet with shared parameters to obtain patch representations, to further predict the permutation used to create the puzzle.
- As the total number of permutations N! can be large, a fixed subset P of the total N! permutations is used. The prediction problem is reduced to classification into one of |P| classes.
- Colorization learns an image representation by predicting color values of an input ‘grayscale’ image. A model is trained to predict the ab colors using L color.
- The output ab space is quantized into a set of discrete bins Q = 313 which reduces the problem to a |Q|-way classification problem. The target ab image Y is soft-encoded into |Q| bins by looking at the K-nearest neighbor bins (default value K=10).
- The ConvNet is trained to predict Z^K from the input lightness image X.
2. Axis 1: Scaling the Pretraining Data Size
- Various subsets of the YFCC-100M dataset — YFCC-[1, 10, 50, 100] million images, are trained. The problem complexity (|P|=2000, K=10) etc., is kept fixed. It is then transferred to VOC07 classification task.
- The performance of the Jigsaw model saturates (log-linearly) as the data scale is increased from 1M to 100M.
3. Axis 2: Scaling the Model Capacity
- An important observation is made that the performance gap between AlexNet and ResNet-50 (as a function of the pre-training dataset size) keeps increasing.
This suggests that higher capacity models are needed to take full advantage of the larger pre-training datasets.
4. Axis 3: Scaling the Problem Complexity
- For Jigsaw, the number of permutations |P| is varied: [100, 701, 2k, 5k, 10k].
- For Colorization, the number of nearest neighbors K for the soft-encoding.
For the Jigsaw approach, we see an improvement in transfer learning performance as the size of the permutation set increases.
The Colorization approach appears to be less sensitive to changes in problem complexity.
- We see 2 point mAP variation across different values of K.
5. Putting It Together
Transfer learning performance increases on all three axes, i.e., increasing problem complexity still gives performance boost on ResNet-50 even at 100M data size.
6. Pre-training and Transfer Domain Relation
- On the VOC07 classification task, pre-training on ImageNet-22k (14M images) transfers as well as pretraining on YFCC-100M (100M images).
- However, on the Places205 classification task, pretraining on YFCC-1M (1M images) transfers as well as pre-training on ImageNet-22k (14M images).
The domain (image distribution) of ImageNet is closer to VOC07 (both are object-centric) whereas YFCC is closer to Places205 (both are scene-centric).
7.1. Image Classification
There is a significant accuracy gap between self-supervised and supervised methods despite the proposed scaling efforts.
7.2. Few-Shot Learning
The self-supervised features are competitive to their supervised counterpart in few-shot setting on both the datasets.
7.3. Object Detection
Self-supervised initialization is competitive with the ImageNet pre-trained initialization on VOC07 dataset even when fewer parameters are fine-tuned on the detection task.
With full fine-tuning, self-supervised model initialization matches the performance of the supervised initialization on both VOC07 and VOC07+12.
- Authors mentioned that future work should focus on designing tasks that are complex enough to exploit large scale data and increased model capacity. The experiments suggest that scaling self-supervision is crucial but there is still a long way to go.
[2019 ICCV] [Goyal ICCV’19]
Scaling and Benchmarking Self-Supervised Visual Representation