Review: Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning

Semi-Supervised Learning Using Different Augmented Version of the Same Sample and Different Perturbations of Network

Sik-Ho Tsang
7 min readApr 7, 2022


Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning, Sajjadi NIPS’16, by University of Utah
2016 NIPS, Over 500 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification

  • An unsupervised loss function is proposed to minimize the difference between the predictions of multiple passes of a training sample.
  • Multiple passes of a training sample consist of different augmented versions of the same sample and also different perturbations of the network.


  1. Proposed Unsupervised Loss Function
  2. Mutual-Exclusivity Loss Function
  3. Total Loss for Semi-Supervised Learning
  4. Experimental Results

1. Proposed Unsupervised Loss Function

  • Given any training sample, a model’s prediction should be the same under any random transformation of the data and perturbations to the model.
Transformation Using Data Augmentation (Figure from Author Slide, link at the end)

The transformations can be any linear and non-linear data augmentation being used to extend the training data.

Disturbance Using Dropout (Figure from Author Slide, link at the end)

The disturbances include Dropout techniques and randomized pooling schemes.

  • In each pass, each sample can be randomly transformed or the hidden nodes can be randomly activated.
  • As a result, the network’s prediction can be different for multiple passes of the same training sample.
  • The network’s prediction is expected to be the same despite transformations and disturbances.

An unsupervised loss function is introduced that minimizes the mean squared differences between different passes of an individual training sample through the network.

  • The proposed loss function is completely unsupervised and can be used along with supervised training as a semi-supervised learning method.
  • A dataset with N training samples and C classes is used.
  • fj(xi) is the classifier’s prediction vector on the i-th training sample during the j-th pass through the network.
  • Each training sample is passed n times through the network.
  • Tj(xi) is a random linear or non-linear transformation on the training sample xi before the j-th pass through the network.
  • The proposed loss function, so called TS loss, for each data sample is:
  • where TS stands for transformation/stability.
  • (The concept of transforming the same sample with different data augmentations is similar to the ones in self-supervised learning nowadays.)

2. Mutual-Exclusivity Loss Function

  • Mutual-exclusivity loss function of [30] forces the classifier’s prediction vector to have only one non-zero element. This loss function naturally complements the transformation/stability loss function.
  • In supervised learning, each element of the prediction vector is pushed towards zero or one depending on the corresponding element in label vector.
  • The proposed TS loss in Section 1 here, minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample, but it does not impose any restrictions on the individual elements of a single prediction vector. As a result, each prediction vector might be a trivial solution instead of a valid prediction due to lack of labels.
  • (This collapsing issue also happens in self-supervised learning.)

Mutual-exclusivity loss function forces each prediction vector to be valid and prevents trivial solutions.

  • Suppose fjk(xi) is the k-th element of prediction vector fj(xi), this loss function for the training sample xi, so called ME loss, is defined as follows:
  • where ME stands for mutual-exclusivity.
  • In the experiments, we show that the combination of both loss functions leads to further improvements in the accuracy of the models.
  • The combination of both loss functions as transformation/stability plus mutual-exclusivity loss function:
  • where λ1=0.1 and λ2=1.

3. Total Loss for Semi-Supervised Learning

  • Though it is said that TS loss and ME loss are used along with supervised loss, but how it is done is not so clear in the paper. It is assumed people know how they are trained with both labeled and unlabeled samples.
  • I later found in author’s presentation slide that the total loss is the sum of supervised loss and unsupervised loss:
  • where lL is supervised loss.

4. Experimental Results

  • Two kinds of models are used: One is convnet model, such as AlexNet. One is sparse convolutional network [39].
  • n=4 for convnet and n=5 for sparse convolutional network [39].
  • (For sparse convolutional network architecture, please feel free to the paper [39] directly. I don’t go into it because in this paper, the losses should be more in focus instead of model architecture.)

4.1. MNIST

Error rates (%) on test set for MNIST (mean % ± std)
  • Sparse convolutional network [39] is used.
  • No data augmentation for this task. In other words, Tj(xi) is identity function. Only Dropout is used.
  • First, a model is trained based on this labeled set only. Then, the models are trained by adding unsupervised loss functions.
  • In separate experiments, transformation/stability loss function, mutual-exclusivity loss function and the combination of both, are added.
  • The state-of-the-art Maxout [40] obtains error rate of 0.24% on MNIST using all training data without data augmentation.

Combination of both loss functions reduces the error rate to 0.55% which is the state-of-the-art for the task of MNIST with 100 labeled samples at that moment. It can be seen that a close accuracy is achieved by using only 100 labeled samples.

4.2. SVHN and NORB

Semi-supervised learning vs. training with labeled data only. Left: SVHN, Right: NORB
  • NORB is a collection of stereo images in six classes.
  • Convnet model is used, which consists of two convolutional layers with 64 maps and kernel size of 5, two locally connected layers with 32 maps and kernel size 3.
  • Different ratios of labeled and unlabeled data are used. 1%, 5%, 10%, 20% and 100% of training samples are used as labeled data.
  • Random crop and random rotation are used as data augmentation.

When the number of labeled samples is small, the improvement is more significant.

For example, when only 1% of labeled data is used, an improvement in accuracy is about 2.5 times by using unsupervised loss functions.

As more labeled samples are added, the difference in accuracy between semi-supervised and supervised approaches becomes smaller.

Error on test data for SVHN and NORB with 1% and 100% of data (mean % ± std)
  • Dropout and random max-pooling are the only sources of variation in the above table.

By using unsupervised loss functions, the accuracy of the classifier is significantly improved by trying to minimize the variation in prediction of the network.

  • In addition, for NORB dataset, by using only 1% of labeled data and applying unsupervised loss functions, the accuracy obtained is close to the case when 100% of labeled data is used.

4.3. CIFAR10

Error rates on test data for CIFAR10 with 4000 labeled samples (mean % ± std)
  • Sparse convolutional network is used.
  • The first set of models is trained on labeled data only, and the other set of models is trained on the unlabeled set using a combination of both unsupervised loss functions in addition to the labeled set.
  • Randomized mix of translations, rotations, flipping, stretching and shearing operations, are used.
  • Dropout is also used.

The combination of unsupervised loss functions on unlabeled data improves the accuracy of the models.

  • When all of the labeled and unlabeled data are used, with transformation/stability plus the mutual-exclusivity loss function, 3.18% error rate is obtained, while state-of-the-art error rate for this dataset is 3.47% [39].
  • With a larger model (160n vs. 96n), 3.00% error rate is achieved.

4.4. CIFAR100

  • The state-of-the-art error rate for this dataset is 23.82% [39].
  • Sparse convolutional network is used.

The proposed loss function minimizes the randomness effect due to Dropout and max-pooling, achieves 21.43% error rate.

4.5. ImageNet

Error rates (%) on ImageNet validation set (Top-5)
  • AlexNet is used.
  • 5 labeled datasets (rep 1 to rep 5) are created from available training samples. Each dataset consists of 10% of training data.
  • Random translations, flipping, and color noise, are used.
  • Two models are trained. One model is trained using labeled data only. The other model is trained on both labeled and unlabeled set using the transformation/stability plus mutual-exclusivity loss function.
  • At each iteration, four different transformed versions of each unlabeled sample are generated.

When 10% of training data is used as labeled set, the network converges in 20 epochs instead of standard 90 epochs of AlexNet model.

  • Even for a large dataset with many categories, the proposed unsupervised loss function improves the classification accuracy. The error rate of a single AlexNet model on ImageNet validation set using all training data is 18.2%.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.