Review — SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification
SimPLE, Outperforms FixMatch, ReMixMatch, MixMatch
SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification
SimPLE, by University of Southern California
2021 CVPR, Over 20 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification, Pseudo-Label
- The relationship between the high confidence unlabeled data that are similar to each other, is less studied.
- A new loss, Pair Loss, is proposed to minimize the statistical distance between high confidence pseudo labels with similarity above a certain threshold.
Outline
- SimPLE
- Experimental Results
- Ablation Study
1. SimPLE
1.1. Problem Description
- In a L-class classification setting, let X=((xb, yb); b∈(1, …, B)) be a batch of labeled data, and U=(ub; b∈(1, …, B)) be a batch of unlabeled data.
- Let pmodel(˜y|x;θ) denote the model’s predicted softmax class probability of input x parameterized by weight θ.
1.2. Augmentation Strategy
- Augmentation Anchoring is used, as also used in ReMixMatch and FixMatch.
- Pseudo labels come from weakly augmented samples act as “anchor”, and the strongly augmented samples is aligned to the “anchor”.
- The weak augmentation is a random cropping followed by a random horizontal flip.
- The strong augmentation is RandAugment, or fixed augmentation strategy, which contains difficult transformations such as random affine and color jitter. This method can adapt to high-intensity augmentation very quickly. Thus, the magnitude is simply fixed to the highest value possible.
- (More details when describing the loss function.)
1.3. Pseudo-Labeling
- Label guessing technique, as in MixMatch, is used.
- The average of the model’s predictions of several (K) weakly augmented versions of the same unlabeled sample is used as its pseudo label. The guessed pseudo label should be more stable.
- Sharpening operation defined in MixMatch is used to increase the temperature of the label’s distribution:
- As the peak of the pseudo label’s distribution is “sharpened”, the network will push this sample further away from the decision boundary.
- In addition, the exponential moving average (EMA) of the model at each time step is used to guess the labels.
1.4. Loss Functions
- The loss consists of three terms: the supervised loss LX, the unsupervised loss LU, and the Pair Loss LP:
- where λU and λP are with different values for different datasets.
1.4.1. Supervised Loss LX
- LX calculates the cross-entropy of weakly augmented labeled samples.
1.4.2. Unsupervised Loss LU
- LU represents the L2 distance between strongly augmented samples and their pseudo labels, filtered by confidence threshold:
- where τc denotes the confidence threshold.
LU only enforces consistency among different perturbations of the same samples but NOT consistency among different samples.
1.4.3. Pair Loss LP
Pair Loss, aim to exploit the relationship among unlabeled samples, which allows information to propagate implicitly between different unlabeled samples.
- In Pair Loss, a high confidence pseudo label of an unlabeled point, p, is used as an “anchor”.
- All unlabeled samples whose pseudo labels are similar enough to p need to align their predictions under severe perturbation to the “anchor”.
- The way to do that is as follows:
- Given a pseudo label ql (red) which is a probability vector representing the guessed class distribution, if the highest entry in ql surpasses the confidence threshold τc, ql will become an “anchor”.
- Then, for any pseudo label and image tuple qr (light blue) and vr (dark blue), if the overlapping proportion (i.e. similarity) between ql and qr is greater than the confidence threshold τs, this tuple (qr, vr) will contribute toward the Pair Loss by pushing model’s prediction of a strongly augmented version of vr to the “anchor” ql (green arrow).
- During this process, if either threshold can not be satisfied, ql, qr, vr will be rejected.
- The Pair Loss is as follows:
- where φt(x) is a hard threshold function controlled by t.
- fsim(p, q) measures the similarity between two probability vectors p, q by Bhattacharyya coefficient:
1.5. Overall SimPLE Algorithm
- During testing, SimPLE uses the exponential moving average (EMA) of the weights of the model to make predictions, as the way done by MixMatch.
2. Experimental Results
2.1. CIFAR-100
- WRN 28–2 and WRN 28–8 are used. λU=150, λP=150.
- SimPLE has significant improvement on CIFAR-100. With a larger backbone, SimPLE still provides improvements over baseline methods.
SimPLE is better than FixMatch by 0.7% and takes only 4.7 hours of training for convergence, while FixMatch takes about 8 hours to converge.
2.2. CIFAR-10 and SHVN
- WRN 28–2 is used. For CIFAR-10, λU=75 and λP=75. For SVHN, λU=λP=250.
- SimPLE is on par with ReMixMatch and FixMatch.
- ReMixMatch, FixMatch, and SimPLE are very close to the fully supervised baseline with less than 1% difference in test accuracy.
SimPLE is less effective on these domains because the leftover samples are difficult ones whose pseudo labels are not similar to any of the high confidence pseudo labels.
Pair Loss does not bring much performance gain as it does in the more complicated datasets.
2.3. Mini-ImageNet
- WRN 29–2 and ResNet-18 are used.
- In general, SimPLE outperforms all other methods by a large margin on Mini-ImageNet regardless of backbones. SimPLE scales with the more challenging dataset.
2.4. DomainNet-Real to Mini-ImageNet
- WRN 28–2 is used.
- The supervised baseline only uses labeled training data and parameter EMA for evaluation. All transfer experiments use fixed augmentations.
The pre-trained models are on par with training from scratch but converge 5 ∼ 100 times faster.
Under transfer setting, SimPLE is 7.57% better than MixMatch and 9.9% better than the supervised baseline.
2.4. ImageNet-1K to DomainNet-Real
- ResNet-50 is used.
- On DomainNet-Real, MixMatch is about 7% lower than the supervised baseline, while SimPLE has 8% higher accuracy than the baseline.
- MixMatch Enhanced, despite having Augmentation Anchoring, does not outperform MixMatch.
It is clear that SimPLE perform well in pre-trained setting and surpasses MixMatch and supervised baselines by a large margin.
- It is noted that the pre-trained models might have domain bias that is not easy to overcome!
3. Ablation Study
- WRN 28–2 is used.
- Pair Loss significantly improves the performance.
- With a more diverse augmentation policy or increasing the number of augmentations, the advantage of the Pair Loss is enhanced.
- Also, SimPLE is robust to threshold change. One possible explanation for the robustness is that since a pair must pass both thresholds to contribute to the loss, changing one of them may not significantly affect the overall number of pairs that pass both thresholds.
Reference
[2021 CVPR] [SimPLE]
SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification
Semi-Supervised Learning
2004 [Entropy Minimization, EntMin] 2013 [Pseudo-Label (PL)] 2015 [Ladder Network, Γ-Model] 2016 [Sajjadi NIPS’16] 2017 [Mean Teacher] [PATE & PATE-G] [Π-Model, Temporal Ensembling] 2018 [WSL] [Oliver NeurIPS’18] 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] 2020 [BiT] [Noisy Student] [SimCLRv2] [UDA] [ReMixMatch] [FixMatch] 2021 [Curriculum Labeling (CL)] [Su CVPR’21] [SimPLE]