Review — Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

Curriculum Labeling (CL), By Restarting Model Parameters Before Each Self-Training Cycle, Outperforms Pseudo Labeling (PL)

Sik-Ho Tsang
6 min readJun 16, 2022
Comparison of Regular Pseudo Labeling (PL) and Pseudo Labeling with Curriculum Labeling (CL) on the “Two Moons” Synthetic Dataset (in Paper Appendix)

Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning, Curriculum Labeling, by University of Virginia
2021 AAAI, Over 40 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification, Pseudo Label

  • Pseudo Labeling (PL) works by applying pseudo-labels to samples in the unlabeled set for model training in a self-training cycle.
  • In Curriculum Labeling (CL), curriculum learning principle is applied and concept drift is avoided by restarting model parameters before each self-training cycle.


  1. Pseudo Labeling (PL) Brief Review
  2. Proposed Curriculum Labeling (CL)
  3. Experimental Results
  4. Ablation Study

1. Pseudo Labeling (PL) Brief Review

Pseudo-Label for Unlabeled Data (Figure from Here)
  • Pseudo-Label are target classes for unlabeled data as if they were true labels. The class, which has maximum predicted probability predicted using a network for each unlabeled sample, is picked up.
  • Pseudo-Label is used in a fine-tuning phase with Dropout. The pre-trained network is trained in a supervised fashion with labeled and unlabeled data simultaneously.
  • (Please feel free to read Pseudo Label story if interested.)

2. Proposed Curriculum Labeling (CL)

2.1. Framework

Curriculum Labeling (CL) Framework
  1. The model is trained on the labeled samples.
  2. Then this model is used to predict and assign pseudo-labels for the unlabeled samples.
  3. The distribution of the prediction scores is used to select a subset of pseudo-labeled samples.
  4. A new model is re-trained with the labeled and pseudo-labeled samples.
  5. This process is repeated by re-labeling unlabeled samples using this new model. The process stops when all samples in the dataset have been used during training.

2.2. Details

Curriculum Labeling (CL) Algorithm
  • To be specific, percentile scores are used to decide which samples to add. The above algorithm shows the full pipeline of our model, where Percentile(X, Tr) returns the value of the r-th percentile. values of r from 0% to 100% are used in increments of 20.
  • The repeating process is terminated when pseudo-labeled set comprise the entire training data samples (r=100%).

2.3. Loss

  • The data consists of N labeled examples (Xi, Yi) and M unlabeled examples Xj. Let H be a set of hypotheses hθ where  ∈ H, and each of them denotes a function mapping X to Y.
  • Let (Xi) denote the loss for a given example Xi. To choose the best predictor with the lowest possible error, the formulation can be explained with a regularized Empirical Risk Minimization (ERM) framework.
  • Below, L(θ) is defined as the pseudo-labeling regularized empirical loss as:
  • where CEE indicates cross entropy.

3. Experimental Results

3.1. SOTA Comparison

Test error rate on CIFAR-10 and SVHN using WideResNet-28 (WRN)
Test error rate on CIFAR-10 and SVHN using CNN-13 (All-CNN)
  • CNN-13 in All-CNN and WideResNet-28 in WRN (depth 28, width 2) are used for CIFAR-10 and SVHN.
  • Data augmentation is the transformations in an entirely random fashion, which is referred as Random Augmentation (RA).

CL surprisingly surpasses previous pseudo-labeling based methods and consistency regularization methods on CIFAR-10.

  • On SVHN, CL obtains competitive test error when compared with all previous methods that rely on moderate augmentation, moderate-to-high data augmentation, and heavy data augmentation.
Comparison of test error rate using WideResNet (WRN) varying the size of the labeled samples on CIFAR-10
  • A common practice to test SSL algorithms, is to vary the size of the labeled data using 50, 100 and 200 samples per class.

CL does not drastically degrade when dealing with smaller labeled sets.

Top-1 and top-5 accuracies on ImageNet with 10% of the labeled set
  • ResNet-50 is used on ImageNet.
  • 10%/90% of the training split as labeled/unlabeled data are used.

On ImageNet, CL achieves competitive results with the state-of-the-art with scores very close to the current top performing method, UDA, on both top-1 and top-5 accuracies.

3.2. Realistic Evaluation with Out-of-Distribution Unlabeled Samples

Comparison of test error on CIFAR-10 (six animal classes) with varying overlap between classes
  • In a more realistic SSL setting of Oliver NeurIPS’18, the unlabeled data may not share the same class set as the labeled data.
  • The experiment is reproduced by synthetically varying the class overlap on CIFAR-10, choosing only the animal classes to perform the classification (bird, cat, deer, dog, frog, horse).

CL is robust to out-of-distribution classes, while the performance of previous methods drops significantly. It is conjectured that the proposed self-pacing curriculum is key to this scenario, where the adaptive thresholding scheme could help filter the out-of-distribution unlabeled samples during training.

4. Ablation Study

4.1. Effectiveness of Curriculum Labeling

Test errors when using pseudo-labeling without a curriculum (the threshold is set to 0.0)
  • Different data augmentations, i.e. mixup and SWA, are used when applying vanilla pseudo-labeling with no curriculum, and without a specific threshold (i.e. 0.0).

Only when heavy data augmentation is used for Pseudo Labeling, the approach is able to match the proposed curriculum design without any data augmentation.

Test errors when using pseudo-labeling with several fixed thresholds and different data augmentation techniques
  • Fixed thresholds, which are used for including pseudo-labelled unlabeled data, used in Pseudo Labeling (PL) are tried.

The proposed curriculum design is able to yield a significant gain over the traditional pseudo-labeling approach that uses a fixed threshold even when heavy data augmentation is applied.

Test errors when using two static thresholds (0.91 and 0.99952) and the proposed self-pacing training
  • Only the most confident samples are re-labeled in CL. The confident thresholds to 0.9 and 0.9995.

As seen, using handpicked thresholds is sub-optimal.

4.2. Effectiveness of Reinitializing vs Finetuning

Comparison of model reinitialization and finetuning, in each iteration of training
  • Reinitializing the model yields at least 1% improvement and does not add a significant overhead to the proposed self-paced approach.

As shown above, reinitializing the model, as opposed to finetuning, indeed improves the accuracy significantly, demonstrating an alternative and perhaps simpler solution to alleviate the issue of confirmation bias.


[2021 AAAI] [Curriculum Labeling (CL)]
Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

Pretraining or Weakly/Semi-Supervised Learning

2004 … 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] 2020 [BiT] [Noisy Student] [SimCLRv2] [UDA] [ReMixMatch] [FixMatch] 2021 [Curriculum Labeling (CL)]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.