Review — Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Sik-Ho Tsang
5 min readApr 28, 2022


Realistic Evaluation of Deep Semi-Supervised Learning Algorithms
Oliver NeurIPS’18, by Google Brain
2018 NeurIPS, Over 600 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification

  • Semi-supervised learning (SSL) problem was studied, yet without any unified experimental settings and evaluations.
  • A unified reimplementation and evaluations for various semi-supervised learning approaches are performed.
  • This is a paper from research group of Ian Goodfellow.


  1. Improved Evaluation
  2. Experimental Results

1. Improved Evaluation

  • First, a common (typically image classification) dataset is used for supervised learning and the labels are discarded for most of the dataset. Then, the portion of the dataset is treated in a way that its labels were retained as a small labeled dataset D and the remainder as an auxiliary unlabeled dataset D_UL.
  • Some SSL improvements and issues are described as follows according to the paper:

1.1. A Shared Implementation

  • Different reimplementations of a simple 13-layer convolutional network are used [32, 39, 50], which results in variability in some implementation details (parameter initialization, data preprocessing, data augmentation, regularization, etc.). Further, the training procedure (optimizer, number of training steps, learning rate decay schedule, etc.) is not standardized.
  • After unification, the only hyperparameters which varied across different SSL algorithms were the learning rate, consistency coefficient, and any hyperparameters unique to a given algorithm.

A shared implementation of the underlying architectures is introduced to compare all of the SSL methods.

1.2. High-Quality Supervised Baseline

  • The goal of SSL is to obtain better performance using the combination of D and D_UL than what would be obtained with D alone. A natural baseline to compare against is the same underlying model (with modified hyperparameters) trained in a fully-supervised manner using only D.

1000 trials of hyperparameter optimization are used to tune both the baseline as well as all the SSL methods.

1.3. Comparison to Transfer Learning

  • A common way to deal with limited data is to “transfer” a model trained on a separate, but similar, large labeled dataset.

The trained model is fine-tuned using the small dataset. This provides a powerful, widely-used, and rarely reported baseline to compare against.

1.4. Considering Class Distribution Mismatch

The effect of differing class distributions between labeled and unlabeled data, i.e. when the data distribution for test samples differs from the training distribution, is studied.

1.5. Varying the Amount of Labeled and Unlabeled Data

  • A somewhat common practice is to vary the size of D by throwing away different amounts of the underlyling labeled dataset [48, 43, 45, 50].

Less common is to vary the size of D_UL in a systematic way.

1.6. Realistically Small Validation Sets

  • An unusual artefact of the way artificial SSL datasets are created is that often the validation set (data used for tuning hyperparameters and not model parameters) is significantly larger than the training set.

In real-world applications, this large validation set would instead be used as the training set. The relationship between the validation set size and variance in estimates of a model’s accuracy, is analyzed.

2. Experimental Results

2.1. Reproduction

Test error rates obtained by various SSL approaches on the standard benchmarks of CIFAR-10 with all but 4,000 labels removed and SVHN with all but 1,000 labels removed, using unified reimplementation
  • The models at the point of lowest validation error for the hyperparameter settings are used for measuring test error.
  • There were different settings implemented in different papers. After unifying the settings, VAT+EntMin obtains the best results, outperforms Π-Model, Mean Teacher, VAT, and Pseudo-Label (PL).

2.1. Fully-Supervised Baseline

Reported change in error rate from fully-supervised (no unlabeled data) to SSL
  • It is found that the gap between the fully-supervised baseline and those obtained with SSL is smaller in this study than what is generally reported in the literature.

2.2. Transfer Learning

Comparison of error rate using SSL and transfer learning
  • After training on downsized ImageNet, the model is fine-tuned using 4,000 labeled data points from CIFAR-10.

There is a lower error rate than any SSL technique achieved using this network, indicating that transfer learning may be a preferable alternative when a labeled dataset suitable for transfer is available.

2.3. Class Distribution Mismatch

Left: Test error for each SSL technique on CIFAR-10 (six animal classes) with varying overlap between classes in the labeled and unlabeled data. Right: Test error for each SSL technique on SVHN with 1,000 labels and varying amounts of unlabeled images from SVHN-extra
  • Left: Adding unlabeled data from a mismatched set of classes can actually hurt performance compared to not using any unlabeled data at all.

This implies that it may be preferable to pay a larger cost to obtain labeled data than to obtain unlabeled data if the unlabeled data is sufficiently unrelated to the core learning task.

2.4. Varying Data Amount

  • Above Figure Right: As expected, increasing the amount of unlabeled data tends to improve the performance of SSL techniques.

However, it is found that performance levelled off consistently across algorithms once 80,000 unlabeled examples were available. Different levels of sensitivity are observed to varying data amounts across SSL techniques.

  • Below Figures: Varying the amount of labeled data tests how performance degrades in the very-limited-label regime.
Test error for each SSL technique on SVHN and CIFAR-10 as the amount of labeled data varies. Shaded regions indicate standard deviation over five trials

In general, the performance of all of the SSL techniques tends to converge as the number of labels grows.

2.5. Small Validation Sets

Left: Average validation error over 10 randomly-sampled nonoverlapping validation sets of varying size. Right: Average and standard deviation of relative error over 10 randomly-sampled nonoverlapping validation sets of varying size.
  • For a realistically-sized validation set (10% of the training set size), differentiating between the performance of the models is not feasible.

SSL methods which rely on heavy hyperparameter tuning on a large validation set may have limited real-world applicability. Cross-validation can help with this problem, but the reduction of variance may still be insufficient and its use would incur an N-fold computational increase.

  • It is argued that with realistically small validation sets, model selection may not be feasible.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.