Review — SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

SimCLR, Outperforms PIRL, MoCo, CMC, CPCv2, CPC, etc.

Sik-Ho Tsang
5 min readMar 1, 2022
SimCLR: A simple framework for contrastive learning of visual representations

A Simple Framework for Contrastive Learning of Visual Representations
SimCLR, by Google Research, Brain Team
2020 ICML, Over 3600 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification, Object Detection

  • SimCLR, a Simple framework for Contrastive Learning of visual Representations, is proposed.
  • A recently proposed contrastive self-supervised learning algorithms is simplified, without requiring specialized architectures or a memory bank.
  • Few major components are systematically studied:
  1. Composition of data augmentations plays a crucial role.
  2. A learnable nonlinear transformation between the representation and the contrastive loss substantially improves the representation quality.
  3. Contrastive learning benefits from larger batch sizes and more training steps.
  • This is a paper from Prof. Hinton’s Group.

Outline

  1. SimCLR Framework
  2. SOTA Comparison

1. SimCLR Framework

SimCLR: A simple framework for contrastive learning of visual representations
  • SimCLR learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space, as shown above.

1.1. Data Augmentation

  • A stochastic data augmentation module that transforms any given data example randomly resulting in two correlated views of the same example, denoted ~xi and ~xj, as positive pairs.
  • Three simple augmentations are applied sequentially: random cropping followed by resize back to the original size, random color distortions, and random Gaussian blur.
Random Crop
  • By randomly cropping images, the contrastive prediction tasks are sampled that include global to local view (B→A) or adjacent view (D→C) prediction.
Illustrations of the studied data augmentation operators
  • The above data augmentation operators are studied.
Linear evaluation (ImageNet top-1 accuracy) under individual or composition of data augmentations, applied only to one branch
  • No single transformation suffices to learn good representations.

The combination of random crop and color distortion is crucial to achieve a good performance.

1.2. Base Encoder

  • A neural network base encoder f() that extracts representation vectors from augmented data examples.

ResNet is used to obtain hi=f(~xi)=ResNet(~xi) where hi is d-dimensional, which is the output after the average pooling layer.

Linear evaluation of models with varied depth and width

Unsupervised learning benefits more from bigger models than its supervised counterpart.

1.3. Projection Head

  • A small neural network projection head g() that maps representations to the space where contrastive loss is applied.
  • A MLP with one hidden layer is used to obtain zi:
  • where σ is a ReLU nonlinearity.
Linear evaluation of representations with different projection heads g() and various dimensions of z = g(h). h has 2048 dimensional

It is found that it is beneficial to define the contrastive loss on zi’s rather than hi’s.

1.4. Contrastive Loss

  • A minibatch of N examples is randomly sampled.
Linear evaluation models (ResNet-50) trained with different batch size and epochs

Large batch size is beneficial, longer training time is also beneficial.

  • The contrastive prediction task is defined on pairs of augmented examples derived from the minibatch, resulting in 2N data points.
  • Given a positive pair, the other 2(N-1) augmented examples within a minibatch as negative examples.
  • The loss function for a positive pair of examples (i, j) is defined as:
  • where sim(,) is cosine similarity, τ is the temperature parameter.
  • The final loss is computed across all positive pairs, both (i, j) and (j, i), in a mini-batch.
  • (For NCE, please feel free to read NCE, Negative Sampling, CPC.)
  • (For temperature parameter, please feel free to read Distillation.)

It is named as NT-Xent (the normalized temperature-scaled cross entropy loss).

Negative loss functions and their gradients.
Linear evaluation (top-1) for models trained with different loss functions. “sh” means using semi-hard negative mining

Different loss functions are tried, NT-Xent is the best one.

2. SOTA Comparison

2.1. Linear Evaluation on ImageNet

ImageNet accuracies Against Number of Parameters
ImageNet accuracies of linear classifiers trained on representations

The best result obtained with ResNet-50 (4×) using SimCLR can match the supervised pretrained ResNet-50.

2.2. Few Labels Evaluation on ImageNet

ImageNet accuracy of models trained with few labels

Again, SimCLR significantly improves over state-of-the-art with both 1% and 10% of the labels.

2.3. Transfer Learning

Comparison of transfer learning performance
  • The ResNet-50 (4×) model is used.

When fine-tuned, SimCLR significantly outperforms the supervised baseline on 5 datasets, whereas the supervised baseline is superior on only 2 (i.e. Pets and Flowers).

  • On the remaining 5 datasets, the models are statistically tied.

There are appendices providing many details and other results, please feel free to read the paper directly if interested.

Later on, MoCo extends as MoCo v2 based on SimCLR’s findings.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.