Review — Revisiting Self-Supervised Visual Representation Learning

Few Pretext Tasks are Evaluated With Different Network Settings

Sik-Ho Tsang
4 min readJul 9, 2022
Quality of visual representations learned by various self-supervised learning techniques significantly depends on the CNN architecture

Revisiting Self-Supervised Visual Representation Learning
Kolesnikov CVPR’19, by Google Brain
2019 CVPR, Over 500 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Residual Network, ResNet

  • A large number of the pretext tasks for self-supervised learning have been studied in prior arts, but many important aspects, such as the choice of CNN, has not received equal attention.
  • In this paper, a thorough large scale study is conducted to uncover multiple crucial insights, rather than proposing a novel self-supervised learning approach.


  1. Evaluation on ImageNet and Places205
  2. Comparison to Prior Work
  3. Other findings

1. Evaluation on ImageNet and Places205

Different network architectures perform significantly differently across self-supervision tasks
Evaluation of representations from self-supervised techniques based on various CNN architectures and different widening factors
  • First, similar models often result in visual representations that have significantly different performance. Importantly, neither is the ranking of architectures consistent across different methods, nor is the ranking of methods consistent across architectures.
  • The second observation is that increasing the number of channels in CNN models improves performance of self-supervised models.
  • It is also observed that ranking of models evaluated on Places205 is consistent with that of models evaluated on ImageNet, indicating that the findings generalize to new datasets.

2. Comparison to Prior Works

Comparison of the published self-supervised models to the proposed best models
  • Surprisingly, as a result of selecting the right architecture for each self-supervision and increasing the widening factor, the proposed models significantly outperform previously reported results.
  • The strongest model, using RotNet/Image Rotations, attains unprecedentedly high accuracy of 55.4%. Similar observations hold when evaluating on Places205.
  • The design choices result in almost halving the gap between previously published self-supervised result and fully-supervised results on two standard benchmarks.

3. Other findings

3.1. A Linear Model is Adequate for Evaluation

Comparing linear evaluation (dotted line) of the representations to non-linear (solid line) evaluation, i.e. training a multi-layer perceptron instead of a linear model.
  • The MLP (non-linear) provides only marginal improvement over the linear evaluation. It is concluded that the linear model is adequate for evaluation purposes.

3.2. Better Performance on the Pretext Task Does NOT Always Translate to Better Representations

A look at how predictive pretext performance is to eventual downstream performance (Colors correspond to the architectures and circle size to the widening factor k.)
  • The performance on the pretext task is plotted against the evaluation on ImageNet.

For instance, according to pretext accuracy, the widest VGG model is the best one for Rotation, but it performs poorly on the downstream task.

3.3. Skip Connections Prevent Degradation of Representation Quality Towards the End of CNNs

Evaluating the representation from various depths within the network
  • VGG19-BN backbone without skip connections, degrades when the representation used is extracted from deeper layers.

3.4. Model Width and Representation Size Strongly Influence the Representation Quality

Disentangling the performance contribution of network widening factor versus representation size
  • The best Rotation model (RevNet50) is used to study this problem, training each model from scratch on ImageNet with the Rotation pretext task.
  • In essence, it is possible to increase performance by increasing either model capacity, or representation size, but increasing both jointly helps most. Notably, one can significantly boost performance of a very thin model from 31% to 43% by increasing representation size.
Low-data regimes
  • Increasing the widening factor consistently boosts performance in both the full- and low-data regimes.

3.5. SGD for Training Linear Model Takes Long Time to Converge

Downstream task accuracy curve of the linear evaluation model trained with SGD on representations from the Rotation task
  • Surprisingly, we observe that very long training (~500 epochs) results in higher accuracy.

For self-supervised learning using pretext tasks, except the choice of pretext tasks, there are still many factors affecting the accuracy, e.g.: network architecture and network width.


[2019 CVPR] [Kolesnikov CVPR’19]
Revisiting Self-Supervised Visual Representation Learning

Self-Supervised Learning

19932019 [Ye CVPR’19] [S⁴L] [Kolesnikov CVPR’19] 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] 2021 [MoCo v3] [SimSiam] [DINO]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.