Review — S⁴L: Self-Supervised Semi-Supervised Learning

S⁴L: Combines Self-Supervised Approach and Semi-Supervised Approach

Sik-Ho Tsang
4 min readMay 19, 2022

S⁴L: Self-Supervised Semi-Supervised Learning
, by Google Research, Brain Team
2019 ICCV, Over 400 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Semi-Supervised Learning, Image Classification

  • By unifying self-supervised learning and semi-supervised learning, the framework of self-supervised semi-supervised learning (S⁴L) is proposed.
  • Mix Of All Models (MOAM) is further proposed by using several techniques together.


  1. Proposed S⁴L
  2. Experimental Results

1. Proposed S⁴L

A schematic illustration of one of the proposed self-supervised semi-supervised techniques: S⁴L-Rotation

1.1. Overall

  • The learning algorithm has access to a labeled training set Dl, which is sampled i.i.d. from p(X, Y) and an unlabeled training set Du, which is sampled i.i.d. from the marginal distribution p(X).
  • The minibatch sizes of Dl and Du are chosen as equal size.

The semi-supervised methods have a learning objective:

where Ll is a standard cross-entropy classification loss of all labeled images in the dataset, Lu is a loss defined on unsupervised images.

  • w=1 is a non-negative scalar weight and θ is the parameters for model.

For self-supervised learning, S⁴L can choose whether to include the minibatch xl into the self-supervised loss, i.e. apply Lself to the union of xu and xl.

  • So, there are Ll, Lu, and Lself losses.

1.1. S⁴L-Rotation for Self-Supervised Learning

  • The key idea of Rotation self-supervision (RotNet) is to rotate an input image then predict the rotation degree:
  • where R is the set of the 4 rotation degrees {0, 90, 180, 270} which results in a 4-class classification problem.
  • The self-supervised loss is also applied to the labeled images in each minibatch.

1.2. S⁴L-Exemplar for Self-Supervised Learning

  • The idea of Exemplar self-supervision is to learn a visual representation that is invariant to a wide range of image transformations (“Inception” cropping, random horizontal mirroring, and HSV-space color randomization), produce 8 different instances of each image in a minibatch.
  • Lu is implemented as the batch hard triplet loss with a soft margin. This encourages transformation of the same image to have similar representations. Lu is applied to all eight instances of each image.

1.3. Semi-Supervised Baselines

  • Virtual Adversarial Training (VAT), Conditional Entropy Minimization (EntMin), and Pseudo-Label (PL) are considered.
  • (Please free feel to click to read them if interested.)

2. Experimental Results

2.1. ImageNet

Top-5 accuracy (%) obtained by individual methods when training them on ILSVRC-2012 with a subset of labels.
  • The proposed way of doing self-supervised semi-supervised learning is indeed effective for the two self-supervision methods that are used. It is hypothesized that such approaches can be designed for other self-supervision objectives.
Comparing our MOAM to previous methods in the literature on ILSVRC-2012 with 10% of the labels

Mix Of All Models (MOAM): First, S⁴L-Rotation and VAT+EntMin are combined to learn a 4 wider model. Then this model is used to generate Pseudo-Label (PL) for a second training step, followed by a final fine-tuning step.

  • Step 1) Rotation+VAT+EntMin: In the first step, the model jointly optimizes the S⁴L-Rotation loss and the VAT and EntMin losses.
  • Step 2) Retraining on Pseudo-Labels (PL): Using the above model, assign pseudo labels to the full dataset and then Step 3) fine-tune the model.
  • The final model “MOAM (full)” achieves 91.23% top-5 accuracy, which sets the new state-of-the-art, outperforms such as UDA and CPCv2.

Interestingly, MOAM achieves promising results even in the high-data regime with 100% labels, outperforming the fully supervised baseline: +0.87% for top-5 accuracy and +1.6% for top-1 accuracy.

2.2. Place205

Places205 learning curves of logistic regression on top of the features learned by pre-training
  • Self-supervision methods are typically evaluated in terms of how generally useful their learned representation is. This is done by treating the learned model as a fixed feature extractor, and training a linear logistic regression model on top the features it extracts on a different dataset: Place205.
  • As can be seen, the logistic regression is able to find a good separating hyperplane in very few epochs and then plateaus, whereas in the self-supervised case it struggles for a very long number of epochs.

This indicates that the addition of labeled data leads to much more separable representations, even across datasets.


[2019 ICCV] [S⁴L]
S⁴L: Self-Supervised Semi-Supervised Learning

Pretraining or Weakly/Semi-Supervised Learning

2004 … 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] 2020 [BiT] [Noisy Student] [SimCLRv2]

Unsupervised/Self-Supervised Learning

19932019 [Ye CVPR’19] [S⁴L] 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] 2021 [MoCo v3] [SimSiam]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.