Review — SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners

Semi-Supervised Learning Using Self-Supervised SimCLRv2, Outperforms , , , , & etc.

Sik-Ho Tsang
6 min readMar 22, 2022

--

Top-1 accuracy of previous SOTA methods SimCLRv2 on ImageNet using only 1% or 10% of the labels

Big Self-Supervised Models are Strong Semi-Supervised Learners
SimCLRv2, by Google Research, Brain Team
2020 NeurIPS, Over 600 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification

  • The proposed semi-supervised learning algorithm can be summarized in three steps:
  1. Unsupervised pretraining of a big model using SimCLRv2,
  2. Supervised fine-tuning on a few labeled examples, and
  3. Distillation with unlabeled examples for refining and transferring the task-specific knowledge.

Outline

  1. SimCLRv2 Semi-Supervised Learning Framework
  2. Step 1: Self-Supervised Pretraining with SimCLRv2
  3. Step 2: Supervised Fine-Tuning
  4. Step 3: Self-Training/Knowledge Distillation via Unlabeled Examples
  5. Experimental Results

1. SimCLRv2 Semi-Supervised Learning Framework

The proposed semi-supervised learning framework leverages unlabeled data in two ways: (1) task-agnostic use in unsupervised pretraining, and (2) task-specific use in self-training / distillation
  • The proposed semi-supervised learning framework leverages unlabeled data in both task-agnostic and task-specific ways.
  1. Unsupervised Pretraining: The first time the unlabeled data is used, it is in a task-agnostic way.
  2. Supervised fine-Tuning: The general representations are then adapted for a specific task via supervised fine-tuning.
  3. Distillation: The second time the unlabeled data is used, it is in a task-specific way.
  • (Please feel free to read if interested.)

2. Step 1: Self-Supervised Pretraining with SimCLRv2

  • SimCLRv2, improves upon in three major ways.

2.1. Larger

  • Models used are deeper but less wide. The largest model used is a 152-layer with 3× wider channels and selective kernels (SK) used in , a channel-wise attention mechanism that improves the parameter efficiency of the network.
  • By scaling up the model from -50 to -152 (3×+SK), a 29% relative improvement in top-1 accuracy is obtained when fine-tuned on 1% of labeled examples.

2.2. Deeper Projection Head

  • The capacity of the non-linear network g() (a.k.a. projection head) is increased, by making it deeper.
  • Furthermore, instead of throwing away g() entirely after pretraining as in , a middle layer of projection head is fine-tuned.
  • Compared to with 2-layer projection head, SimCLRv2 uses a 3-layer projection head and fine-tunes from the 1st layer of projection head, it results in as much as 14% relative improvement in top-1 accuracy when fine-tuned on 1% of labeled examples.

2.3. Memory Mechanism from

  • A memory network (with a moving average of weights for stabilization) whose output will be buffered as negative examples, is used.
  • The memory buffer is set to 64K. Exponential moving average (EMA) decay is set to 0.999.
  • This change yields an improvement of ~1% for linear evaluation as well as when fine-tuning on 1% of labeled examples.

3. Step 2: Supervised Fine-Tuning

  • As mentioned above, in , the MLP projection head g() is discarded entirely after pretraining.
  • The model from a middle layer of the projection head is fine-tuned in SimCLRv2, instead of the input layer of the projection head as in .

4. Step 3: Self-Training/Knowledge Distillation via Unlabeled Examples

  • The fine-tuned network is used as a teacher to impute labels for training a student network.
  • Specifically, the following distillation loss is minimized where no real labels are used:
  • where
  • And τ is temperature for model .
  • The teacher network, which produces PT(y|xi), is fixed during the distillation. Only the student network, which produces PS(y|xi), is trained.
  • In this paper, authors focus on using unlabeled examples only.
  • One can also combine the distillation loss with ground-truth labeled examples using a weighted combination:
  • This procedure can be performed using students either with the same model architecture (self-distillation), which further improves the task-specific performance, or with a smaller model architecture, which leads to a compact model.

5. Experimental Results

5.1. Bigger Models are More Label-Efficient

Top-1 accuracy of fine-tuning SimCLRv2 models (on varied label fractions) or training a linear classifier on the representations. (The supervised baselines are trained from scratch using all labels in 90 epochs.)
  • models are trained by varying width and depth as well as whether or not to use selective kernels (SK). The smallest model is the standard -50, and biggest model is -152 (3×+SK).

Increasing width and depth, as well as using SK, all improve the performance.

  • But -152 (3×+SK) is only marginally better than -152 (2×+SK), though the parameter size is almost doubled, suggesting that the benefits of width may have plateaued.
Top-1 accuracy for supervised vs semi-supervised (SimCLRv2 fine-tuned) models of varied sizes on different label fractions
  • These results show that bigger models are more label-efficient for both supervised and semi-supervised learning, but gains appear to be larger for semi-supervised learning.

5.2. Bigger/Deeper Projection Heads Improve Representation Learning

Top-1 accuracy via fine-tuning under different projection head settings & label fractions (using -50)
  • Wider ResNets also have wider projection heads.
  • Using a deeper projection head during pretraining is better when fine-tuning from the optimal layer of projection head (Figure a), and this optimal layer is typically the first layer of projection head rather than the input (0th layer), especially when fine-tuning on fewer labeled examples (Figure b).

5.3. Using Unlabeled Data Improves Semi-Supervised Learning

Top-1 accuracy of a -50 trained on different types of targets
  • The above table demonstrates the importance of using unlabeled examples when training with the distillation loss.
  • Furthermore, using the distillation loss alone works almost as well as balancing distillation and label losses.
Top-1 accuracy of distilled SimCLRv2 models compared to the fine-tuned models as well as supervised learning with all labels
  • When the student model has a smaller architecture than the teacher model, it improves the model efficiency by transferring task-specific knowledge to a student model.
  • Even when the student model has the same architecture as the teacher model (excluding the projection head after encoder), self-distillation can still meaningfully improve the semi-supervised learning performance.

5.4. SOTA Comparison

ImageNet accuracy of models trained under semi-supervised settings on ImageNet

The proposed approach greatly improves upon previous results such as , , , , , , , & etc., for both small and big variants.

Impact

Authors also mentioned the impact of semi-supervised learning at the end of the paper:

  • In medical applications where acquiring high-quality labels requires careful annotation by clinicians, better semi-supervised learning approaches can potentially help save lives.
  • Applications of computer vision to agriculture can increase crop yields, which may help to improve the availability of food. … etc.

Reference

[2020 NeurIPS] [SimCLRv2]

Unsupervised/Self-Supervised Learning

1993–2017 … 2018 [RotNet/Image Rotations] [] [] [] 2019 [] 2020 [] [] [] [] [] [] [] [] [] [SimCLRv2]

Pretraining or Weakly/Semi-Supervised Learning

2013 [] 2017 [] 2018 [] 2019 [] [] [] 2020 [] [] [SimCLRv2]

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response