Review — S⁴L: Self-Supervised Semi-Supervised Learning
S⁴L: Combines Self-Supervised Approach and Semi-Supervised Approach
S⁴L: Self-Supervised Semi-Supervised Learning
S⁴L, by Google Research, Brain Team
2019 ICCV, Over 400 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Semi-Supervised Learning, Image Classification
- By unifying self-supervised learning and semi-supervised learning, the framework of self-supervised semi-supervised learning (S⁴L) is proposed.
- Mix Of All Models (MOAM) is further proposed by using several techniques together.
Outline
- Proposed S⁴L
- Experimental Results
1. Proposed S⁴L
1.1. Overall
- The learning algorithm has access to a labeled training set Dl, which is sampled i.i.d. from p(X, Y) and an unlabeled training set Du, which is sampled i.i.d. from the marginal distribution p(X).
- The minibatch sizes of Dl and Du are chosen as equal size.
The semi-supervised methods have a learning objective:
where Ll is a standard cross-entropy classification loss of all labeled images in the dataset, Lu is a loss defined on unsupervised images.
- w=1 is a non-negative scalar weight and θ is the parameters for model.
For self-supervised learning, S⁴L can choose whether to include the minibatch xl into the self-supervised loss, i.e. apply Lself to the union of xu and xl.
- So, there are Ll, Lu, and Lself losses.
1.1. S⁴L-Rotation for Self-Supervised Learning
- The key idea of Rotation self-supervision (RotNet) is to rotate an input image then predict the rotation degree:
- where R is the set of the 4 rotation degrees {0, 90, 180, 270} which results in a 4-class classification problem.
- The self-supervised loss is also applied to the labeled images in each minibatch.
1.2. S⁴L-Exemplar for Self-Supervised Learning
- The idea of Exemplar self-supervision is to learn a visual representation that is invariant to a wide range of image transformations (“Inception” cropping, random horizontal mirroring, and HSV-space color randomization), produce 8 different instances of each image in a minibatch.
- Lu is implemented as the batch hard triplet loss with a soft margin. This encourages transformation of the same image to have similar representations. Lu is applied to all eight instances of each image.
1.3. Semi-Supervised Baselines
- Virtual Adversarial Training (VAT), Conditional Entropy Minimization (EntMin), and Pseudo-Label (PL) are considered.
- (Please free feel to click to read them if interested.)
2. Experimental Results
2.1. ImageNet
- The proposed way of doing self-supervised semi-supervised learning is indeed effective for the two self-supervision methods that are used. It is hypothesized that such approaches can be designed for other self-supervision objectives.
Mix Of All Models (MOAM): First, S⁴L-Rotation and VAT+EntMin are combined to learn a 4 wider model. Then this model is used to generate Pseudo-Label (PL) for a second training step, followed by a final fine-tuning step.
- Step 1) Rotation+VAT+EntMin: In the first step, the model jointly optimizes the S⁴L-Rotation loss and the VAT and EntMin losses.
- Step 2) Retraining on Pseudo-Labels (PL): Using the above model, assign pseudo labels to the full dataset and then Step 3) fine-tune the model.
- The final model “MOAM (full)” achieves 91.23% top-5 accuracy, which sets the new state-of-the-art, outperforms such as UDA and CPCv2.
Interestingly, MOAM achieves promising results even in the high-data regime with 100% labels, outperforming the fully supervised baseline: +0.87% for top-5 accuracy and +1.6% for top-1 accuracy.
2.2. Place205
- Self-supervision methods are typically evaluated in terms of how generally useful their learned representation is. This is done by treating the learned model as a fixed feature extractor, and training a linear logistic regression model on top the features it extracts on a different dataset: Place205.
- As can be seen, the logistic regression is able to find a good separating hyperplane in very few epochs and then plateaus, whereas in the self-supervised case it struggles for a very long number of epochs.
This indicates that the addition of labeled data leads to much more separable representations, even across datasets.
Reference
[2019 ICCV] [S⁴L]
S⁴L: Self-Supervised Semi-Supervised Learning
Pretraining or Weakly/Semi-Supervised Learning
2004 … 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] 2020 [BiT] [Noisy Student] [SimCLRv2]
Unsupervised/Self-Supervised Learning
1993 … 2019 [Ye CVPR’19] [S⁴L] 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] 2021 [MoCo v3] [SimSiam]