Review — Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks

Pseudo Labels for Unlabeled Data

Sik-Ho Tsang
4 min readJan 25, 2022


Pseudo-Label for Unlabeled Data (Figure from Here)

Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks
Pseudo-Label (PL), by Nangman Computing
2013 ICLRW, Over 1500 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Pseudo Label, Image Classification

  • Unlabeled data is labelled by supervised-learnt network, which is so called pseudo labeling.
  • Network is then trained using both labeled data and pseudo-labeled data.


  1. Pseudo-Label (PL)
  2. Experimental Results

1. Pseudo-Label (PL)

  • Pseudo-Label are target classes for unlabeled data as if they were true labels. The class, which has maximum predicted probability predicted using a network for each unlabeled sample, is picked up:
  • Pseudo-Label is used in a fine-tuning phase with Dropout. The pre-trained network is trained in a supervised fashion with labeled and unlabeled data simultaneously:
  • where n is the number of samples in labeled data for SGD, n is the number of samples in unlabeled data; C is the number of classes;
  • fmi is the output for labeled data, ymi is the corresponding label;
  • f’mi for unlabeled data, y’mi is the corresponding pseudo-label;
  • α(t) is a coefficient balancing them at epoch t. If α(t) is too high, it disturbs training even for labeled data. Whereas if α(t) is too small, we cannot use benefit from unlabeled data.
  • α(t) is slowly increased, to help the optimization process to avoid poor local minima:

2. Experimental Results

2.1. t-SNE Visualization

t-SNE 2-D embedding of the network output of MNIST test data.
  • MNIST dataset is used.
  • The neural network was trained with 600 labeled data and with or without 60000 unlabeled data and Pseudo-Labels.
  • The neural network has 1 hidden layer. ReLU is used for hidden unit, Sigmoid Unit is used for output unit. The number of hidden units is 5000.

Though the train error is zero in the two cases, the network outputs of test data is more condensed near 1-of-K code by training with unlabeled data and Pseudo-Labels.

2.2. Entropy

The Conditional Entropy of the network output of labeled(train) data, unlabeled data and test data on MNIST.
  • DropNN: Trained without unlabelled data. (Drop means Dropout.)
  • +PL: Trained with unlabelled data.

Though the entropy of labeled data is near zero in the two cases, the entropy of unlabeled data get lower by Pseudo-Label training, in addition, the entropy of test data get lower along with that.

2.3. Error Rate

Classification error on the MNIST test set with 600, 1000 and 3000 labeled training samples.
  • The size of the labeled training set is reduced to 100, 600, 1000 and 3000. For validation set, 1000 labeled examples are picked up separately.
  • 10 experiments on random split were done using the identical network and parameters. In the case of 100 labeled data, the results heavily depended on data split so that 30 experiments were done.

The proposed method outperforms the conventional methods for small labeled data in spite of simplicity. The training scheme is less complex than Manifold Tangent Classifier and doesn’t use computationally expensive similarity matrix between samples.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.