# Review — SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

## SimPLE, Outperforms FixMatch, ReMixMatch, MixMatch

--

SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised ClassificationSimPLE, by University of Southern California2021 CVPR, Over 20 Citations(Sik-Ho Tsang @ Medium)

Semi-Supervised Learning, Image Classification, Pseudo-Label

- The relationship between the high confidence unlabeled data that are similar to each other, is less studied.
- A new loss,
**Pair Loss**, is proposed to**minimize the statistical distance between high confidence pseudo labels with similarity above a certain threshold**.

# Outline

**SimPLE****Experimental Results****Ablation Study**

# 1. SimPLE

## 1.1. Problem Description

- In a
setting, let*L*-class classificationbe*X*=((*xb*,*yb*);*b*∈(1, …,*B*))**a batch of labeled data**, andbe*U*=(*ub*;*b*∈(1, …,*B*))**a batch of unlabeled data.** - Let
denote the*pmodel*(˜*y*|*x*;*θ*)**model’s predicted softmax class probability**of input*x*parameterized by weight*θ*.

## 1.2. Augmentation Strategy

**Augmentation Anchoring**is used, as also used in ReMixMatch and FixMatch.- Pseudo labels come from weakly augmented samples act as “anchor”, and the strongly augmented samples is aligned to the “anchor”.
- The
**weak augmentation**is a**random cropping followed by a random horizontal flip.** - The
**strong augmentation**is**RandAugment**, or**fixed augmentation strategy**, which contains**difficult transformations**such as**random affine and color jitter**. This method can adapt to high-intensity augmentation very quickly. Thus, the magnitude is simply fixed to the highest value possible. - (More details when describing the loss function.)

## 1.3. Pseudo-Labeling

**Label guessing technique,**as in MixMatch, is used.**The average of the model’s predictions of several (**is used as its pseudo label. The guessed pseudo label should be more stable.*K*) weakly augmented versions of the same unlabeled sample**Sharpening operation**defined in MixMatch is used to increase the temperature of the label’s distribution:

- As the peak of the
**pseudo label’s distribution is “sharpened”**, the network will**push this sample further away from the decision boundary**. - In addition, the
**exponential moving average (EMA)**of the model at each time step is used to**guess the labels**.

## 1.4. Loss Functions

- The loss consists of three terms: the
**supervised loss**, the*LX***unsupervised loss**, and the*LU***Pair Loss**:*LP*

- where
and*λU***λ**are with*P***different values for different datasets**.

## 1.4.1. **Supervised Loss ***LX*

*LX*

calculates the*LX***cross-entropy of weakly augmented labeled samples**.

## 1.4.2. Unsupervised Loss LU

represents the*LU*, filtered by confidence threshold:*L*2 distance between strongly augmented samples and their pseudo labels

- where
*τc*denotes the confidence threshold.

butLUonly enforces consistency among different perturbations of the same samplesNOT consistency among different samples.

## 1.4.3. Pair Loss LP

Pair Loss,

aim to exploit the relationship among unlabeled samples, whichallows information to propagate implicitly between different unlabeled samples.

- In Pair Loss,
**a high confidence pseudo label**of an unlabeled point,*p***“anchor”**. **All unlabeled samples**whose pseudo labels are**similar enough to**need to*p***align their predictions under severe perturbation to the “anchor”.**- The way to do that is as follows:

- Given
**a pseudo label**which is a probability vector representing the guessed class distribution,*ql*(red)**if the highest entry in**,*ql*surpasses the confidence threshold*τc**ql*will become an**“anchor”**. - Then,
**for any pseudo label and image tuple**, if the*qr*(light blue) and*vr*(dark blue)**overlapping proportion (i.e. similarity) between**, this tuple (*ql*and*qr*is greater than the confidence threshold*τs**qr*,*vr*) will**contribute toward the Pair Loss**by pushing model’s prediction of a strongly augmented version of*vr*to the “anchor”*ql*(green arrow). - During this process, if either threshold can not be satisfied,
*ql*,*qr*,*vr*will be rejected. - The
**Pair Loss**is as follows:

- where
*φt*(*x*) is a hard threshold function controlled by*t*. measures the*fsim*(*p*,*q*)**similarity between two probability vectors***p*,*q*by Bhattacharyya coefficient:

## 1.5. Overall SimPLE **Algorithm**

- During
**testing**, SimPLE uses the**exponential moving average (EMA) of the weights of the model to make predictions**, as the way done by MixMatch.

# 2. Experimental Results

## 2.1. CIFAR-100

- WRN 28–2 and WRN 28–8 are used.
*λU*=150,*λP*=150. - SimPLE has
**significant improvement**on CIFAR-100. With a larger backbone, SimPLE still provides improvements over baseline methods.

SimPLE is better thanFixMatchby 0.7%and takesonly 4.7 hours of training for convergence, whileFixMatchtakes about 8 hoursto converge.

## 2.2. CIFAR-10 and SHVN

- WRN 28–2 is used. For CIFAR-10,
*λU*=75 and*λP*=75. For SVHN,*λU*=*λP*=250. - SimPLE is on par with ReMixMatch and FixMatch.
- ReMixMatch, FixMatch, and SimPLE are
**very close to the fully supervised baseline**with less than 1% difference in test accuracy.

SimPLE is less effectiveon these domains because theleftover samples are difficult ones whose pseudo labels are not similar to any of the high confidence pseudo labels.

Pair Loss does not bring much performance gainas it does in the more complicated datasets.

## 2.3. Mini-ImageNet

- WRN 29–2 and ResNet-18 are used.
- In general,
**SimPLE outperforms all other methods by a large margin on Mini-ImageNet regardless of backbones.**SimPLE scales with the more challenging dataset.

## 2.4. DomainNet-Real to Mini-ImageNet

- WRN 28–2 is used.
- The supervised baseline only uses labeled training data and parameter EMA for evaluation. All transfer experiments use fixed augmentations.

The pre-trained models are on par with training from scratch but converge 5 ∼ 100 times faster.Under transfer setting,

SimPLE is 7.57% better thanMixMatchand9.9% better than the supervised baseline.

## 2.4. ImageNet-1K to DomainNet-Real

- ResNet-50 is used.
- On DomainNet-Real, MixMatch is about 7% lower than the supervised baseline, while
**SimPLE has 8% higher accuracy than the baseline**. - MixMatch Enhanced, despite having Augmentation Anchoring, does not outperform MixMatch.

It is clear that

SimPLE perform well in pre-trained setting and surpassesMixMatchand supervised baselines by a large margin.

- It is noted that
**the pre-trained models might have domain bias that is not easy to overcome**!

# 3. Ablation Study

- WRN 28–2 is used.
**Pair Loss**significantly improves the performance.**With a more diverse augmentation policy**or increasing the number of augmentations,**the advantage of the Pair Loss is enhanced**.- Also,
**SimPLE is robust to threshold change.**One possible explanation for the robustness is that since a pair must pass both thresholds to contribute to the loss, changing one of them may not significantly affect the overall number of pairs that pass both thresholds.

## Reference

[2021 CVPR] [SimPLE]

SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification

## Semi-Supervised Learning

**2004 **[Entropy Minimization, EntMin] **2013** [Pseudo-Label (PL)] **2015** [Ladder Network, Γ-Model] **2016 **[Sajjadi NIPS’16] **2017** [Mean Teacher] [PATE & PATE-G] [Π-Model, Temporal Ensembling] **2018 **[WSL] [Oliver NeurIPS’18] **2019** [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] **2020 **[BiT] [Noisy Student] [SimCLRv2] [UDA] [ReMixMatch] [FixMatch] **2021 **[Curriculum Labeling (CL)] [Su CVPR’21] [SimPLE]