Review — Revisiting Self-Supervised Visual Representation Learning
Few Pretext Tasks are Evaluated With Different Network Settings
Revisiting Self-Supervised Visual Representation Learning
Kolesnikov CVPR’19, by Google Brain
2019 CVPR, Over 500 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Residual Network, ResNet
- A large number of the pretext tasks for self-supervised learning have been studied in prior arts, but many important aspects, such as the choice of CNN, has not received equal attention.
- In this paper, a thorough large scale study is conducted to uncover multiple crucial insights, rather than proposing a novel self-supervised learning approach.
Outline
- Evaluation on ImageNet and Places205
- Comparison to Prior Work
- Other findings
- Various pretext tasks are evaluated, such as Rotation (RotNet), Exemplar, RelPatchLoc, and Jigsaw.
- Various backbones are studied, such as ResNet v1 (ResNet), ResNet v2 (Pre-activation ResNet), RevNet (Reversible ResNet), VGG19-BN (VGG-19 with Batch Norm).
- (Please feel free to read them if interested.)
1. Evaluation on ImageNet and Places205
- First, similar models often result in visual representations that have significantly different performance. Importantly, neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures.
- The second observation is that increasing the number of channels in CNN models improves performance of self-supervised models.
- It is also observed that ranking of models evaluated on Places205 is consistent with that of models evaluated on ImageNet, indicating that the findings generalize to new datasets.
2. Comparison to Prior Works
- Surprisingly, as a result of selecting the right architecture for each self-supervision and increasing the widening factor, the proposed models significantly outperform previously reported results.
- The strongest model, using RotNet/Image Rotations, attains unprecedentedly high accuracy of 55.4%. Similar observations hold when evaluating on Places205.
- The design choices result in almost halving the gap between previously published self-supervised result and fully-supervised results on two standard benchmarks.
3. Other findings
3.1. A Linear Model is Adequate for Evaluation
- The MLP (non-linear) provides only marginal improvement over the linear evaluation. It is concluded that the linear model is adequate for evaluation purposes.
3.2. Better Performance on the Pretext Task Does NOT Always Translate to Better Representations
- The performance on the pretext task is plotted against the evaluation on ImageNet.
For instance, according to pretext accuracy, the widest VGG model is the best one for Rotation, but it performs poorly on the downstream task.
3.3. Skip Connections Prevent Degradation of Representation Quality Towards the End of CNNs
- VGG19-BN backbone without skip connections, degrades when the representation used is extracted from deeper layers.
3.4. Model Width and Representation Size Strongly Influence the Representation Quality
- The best Rotation model (RevNet50) is used to study this problem, training each model from scratch on ImageNet with the Rotation pretext task.
- In essence, it is possible to increase performance by increasing either model capacity, or representation size, but increasing both jointly helps most. Notably, one can significantly boost performance of a very thin model from 31% to 43% by increasing representation size.
- Increasing the widening factor consistently boosts performance in both the full- and low-data regimes.
3.5. SGD for Training Linear Model Takes Long Time to Converge
- Surprisingly, we observe that very long training (~500 epochs) results in higher accuracy.
For self-supervised learning using pretext tasks, except the choice of pretext tasks, there are still many factors affecting the accuracy, e.g.: network architecture and network width.
Reference
[2019 CVPR] [Kolesnikov CVPR’19]
Revisiting Self-Supervised Visual Representation Learning
Self-Supervised Learning
1993 … 2019 [Ye CVPR’19] [S⁴L] [Kolesnikov CVPR’19] 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] 2021 [MoCo v3] [SimSiam] [DINO]