Review — Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning
Curriculum Labeling (CL), By Restarting Model Parameters Before Each Self-Training Cycle, Outperforms Pseudo Labeling (PL)
Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning, Curriculum Labeling, by University of Virginia
2021 AAAI, Over 40 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification, Pseudo Label
- Pseudo Labeling (PL) works by applying pseudo-labels to samples in the unlabeled set for model training in a self-training cycle.
- In Curriculum Labeling (CL), curriculum learning principle is applied and concept drift is avoided by restarting model parameters before each self-training cycle.
Outline
- Pseudo Labeling (PL) Brief Review
- Proposed Curriculum Labeling (CL)
- Experimental Results
- Ablation Study
1. Pseudo Labeling (PL) Brief Review
- Pseudo-Label are target classes for unlabeled data as if they were true labels. The class, which has maximum predicted probability predicted using a network for each unlabeled sample, is picked up.
- Pseudo-Label is used in a fine-tuning phase with Dropout. The pre-trained network is trained in a supervised fashion with labeled and unlabeled data simultaneously.
- (Please feel free to read Pseudo Label story if interested.)
2. Proposed Curriculum Labeling (CL)
2.1. Framework
- The model is trained on the labeled samples.
- Then this model is used to predict and assign pseudo-labels for the unlabeled samples.
- The distribution of the prediction scores is used to select a subset of pseudo-labeled samples.
- A new model is re-trained with the labeled and pseudo-labeled samples.
- This process is repeated by re-labeling unlabeled samples using this new model. The process stops when all samples in the dataset have been used during training.
2.2. Details
- To be specific, percentile scores are used to decide which samples to add. The above algorithm shows the full pipeline of our model, where Percentile(X, Tr) returns the value of the r-th percentile. values of r from 0% to 100% are used in increments of 20.
- The repeating process is terminated when pseudo-labeled set comprise the entire training data samples (r=100%).
2.3. Loss
- The data consists of N labeled examples (Xi, Yi) and M unlabeled examples Xj. Let H be a set of hypotheses hθ where hθ ∈ H, and each of them denotes a function mapping X to Y.
- Let Lθ(Xi) denote the loss for a given example Xi. To choose the best predictor with the lowest possible error, the formulation can be explained with a regularized Empirical Risk Minimization (ERM) framework.
- Below, L(θ) is defined as the pseudo-labeling regularized empirical loss as:
- where CEE indicates cross entropy.
3. Experimental Results
3.1. SOTA Comparison
- CNN-13 in All-CNN and WideResNet-28 in WRN (depth 28, width 2) are used for CIFAR-10 and SVHN.
- Data augmentation is the transformations in an entirely random fashion, which is referred as Random Augmentation (RA).
CL surprisingly surpasses previous pseudo-labeling based methods and consistency regularization methods on CIFAR-10.
- On SVHN, CL obtains competitive test error when compared with all previous methods that rely on moderate augmentation, moderate-to-high data augmentation, and heavy data augmentation.
- A common practice to test SSL algorithms, is to vary the size of the labeled data using 50, 100 and 200 samples per class.
CL does not drastically degrade when dealing with smaller labeled sets.
- ResNet-50 is used on ImageNet.
- 10%/90% of the training split as labeled/unlabeled data are used.
On ImageNet, CL achieves competitive results with the state-of-the-art with scores very close to the current top performing method, UDA, on both top-1 and top-5 accuracies.
3.2. Realistic Evaluation with Out-of-Distribution Unlabeled Samples
- In a more realistic SSL setting of Oliver NeurIPS’18, the unlabeled data may not share the same class set as the labeled data.
- The experiment is reproduced by synthetically varying the class overlap on CIFAR-10, choosing only the animal classes to perform the classification (bird, cat, deer, dog, frog, horse).
CL is robust to out-of-distribution classes, while the performance of previous methods drops significantly. It is conjectured that the proposed self-pacing curriculum is key to this scenario, where the adaptive thresholding scheme could help filter the out-of-distribution unlabeled samples during training.
4. Ablation Study
4.1. Effectiveness of Curriculum Labeling
- Different data augmentations, i.e. mixup and SWA, are used when applying vanilla pseudo-labeling with no curriculum, and without a specific threshold (i.e. 0.0).
Only when heavy data augmentation is used for Pseudo Labeling, the approach is able to match the proposed curriculum design without any data augmentation.
- Fixed thresholds, which are used for including pseudo-labelled unlabeled data, used in Pseudo Labeling (PL) are tried.
The proposed curriculum design is able to yield a significant gain over the traditional pseudo-labeling approach that uses a fixed threshold even when heavy data augmentation is applied.
- Only the most confident samples are re-labeled in CL. The confident thresholds to 0.9 and 0.9995.
As seen, using handpicked thresholds is sub-optimal.
4.2. Effectiveness of Reinitializing vs Finetuning
- Reinitializing the model yields at least 1% improvement and does not add a significant overhead to the proposed self-paced approach.
As shown above, reinitializing the model, as opposed to finetuning, indeed improves the accuracy significantly, demonstrating an alternative and perhaps simpler solution to alleviate the issue of confirmation bias.
Reference
[2021 AAAI] [Curriculum Labeling (CL)]
Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning
Pretraining or Weakly/Semi-Supervised Learning
2004 … 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] 2020 [BiT] [Noisy Student] [SimCLRv2] [UDA] [ReMixMatch] [FixMatch] 2021 [Curriculum Labeling (CL)]