Review: Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning

Semi-Supervised Learning Using Different Augmented Version of the Same Sample and Different Perturbations of Network

  • An unsupervised loss function is proposed to minimize the difference between the predictions of multiple passes of a training sample.
  • Multiple passes of a training sample consist of different augmented versions of the same sample and also different perturbations of the network.


  1. Proposed Unsupervised Loss Function
  2. Mutual-Exclusivity Loss Function
  3. Total Loss for Semi-Supervised Learning
  4. Experimental Results

1. Proposed Unsupervised Loss Function

  • Given any training sample, a model’s prediction should be the same under any random transformation of the data and perturbations to the model.
Transformation Using Data Augmentation (Figure from Author Slide, link at the end)
Disturbance Using Dropout (Figure from Author Slide, link at the end)
  • In each pass, each sample can be randomly transformed or the hidden nodes can be randomly activated.
  • As a result, the network’s prediction can be different for multiple passes of the same training sample.
  • The network’s prediction is expected to be the same despite transformations and disturbances.
  • The proposed loss function is completely unsupervised and can be used along with supervised training as a semi-supervised learning method.
  • A dataset with N training samples and C classes is used.
  • fj(xi) is the classifier’s prediction vector on the i-th training sample during the j-th pass through the network.
  • Each training sample is passed n times through the network.
  • Tj(xi) is a random linear or non-linear transformation on the training sample xi before the j-th pass through the network.
  • The proposed loss function, so called TS loss, for each data sample is:
  • where TS stands for transformation/stability.
  • (The concept of transforming the same sample with different data augmentations is similar to the ones in self-supervised learning nowadays.)

2. Mutual-Exclusivity Loss Function

  • Mutual-exclusivity loss function of [30] forces the classifier’s prediction vector to have only one non-zero element. This loss function naturally complements the transformation/stability loss function.
  • In supervised learning, each element of the prediction vector is pushed towards zero or one depending on the corresponding element in label vector.
  • The proposed TS loss in Section 1 here, minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample, but it does not impose any restrictions on the individual elements of a single prediction vector. As a result, each prediction vector might be a trivial solution instead of a valid prediction due to lack of labels.
  • (This collapsing issue also happens in self-supervised learning.)
  • Suppose fjk(xi) is the k-th element of prediction vector fj(xi), this loss function for the training sample xi, so called ME loss, is defined as follows:
  • where ME stands for mutual-exclusivity.
  • In the experiments, we show that the combination of both loss functions leads to further improvements in the accuracy of the models.
  • The combination of both loss functions as transformation/stability plus mutual-exclusivity loss function:
  • where λ1=0.1 and λ2=1.

3. Total Loss for Semi-Supervised Learning

  • Though it is said that TS loss and ME loss are used along with supervised loss, but how it is done is not so clear in the paper. It is assumed people know how they are trained with both labeled and unlabeled samples.
  • I later found in author’s presentation slide that the total loss is the sum of supervised loss and unsupervised loss:
  • where lL is supervised loss.

4. Experimental Results

  • Two kinds of models are used: One is convnet model, such as AlexNet. One is sparse convolutional network [39].
  • n=4 for convnet and n=5 for sparse convolutional network [39].
  • (For sparse convolutional network architecture, please feel free to the paper [39] directly. I don’t go into it because in this paper, the losses should be more in focus instead of model architecture.)

4.1. MNIST

Error rates (%) on test set for MNIST (mean % ± std)
  • Sparse convolutional network [39] is used.
  • No data augmentation for this task. In other words, Tj(xi) is identity function. Only Dropout is used.
  • First, a model is trained based on this labeled set only. Then, the models are trained by adding unsupervised loss functions.
  • In separate experiments, transformation/stability loss function, mutual-exclusivity loss function and the combination of both, are added.
  • The state-of-the-art Maxout [40] obtains error rate of 0.24% on MNIST using all training data without data augmentation.

4.2. SVHN and NORB

Semi-supervised learning vs. training with labeled data only. Left: SVHN, Right: NORB
  • NORB is a collection of stereo images in six classes.
  • Convnet model is used, which consists of two convolutional layers with 64 maps and kernel size of 5, two locally connected layers with 32 maps and kernel size 3.
  • Different ratios of labeled and unlabeled data are used. 1%, 5%, 10%, 20% and 100% of training samples are used as labeled data.
  • Random crop and random rotation are used as data augmentation.
Error on test data for SVHN and NORB with 1% and 100% of data (mean % ± std)
  • Dropout and random max-pooling are the only sources of variation in the above table.
  • In addition, for NORB dataset, by using only 1% of labeled data and applying unsupervised loss functions, the accuracy obtained is close to the case when 100% of labeled data is used.

4.3. CIFAR10

Error rates on test data for CIFAR10 with 4000 labeled samples (mean % ± std)
  • Sparse convolutional network is used.
  • The first set of models is trained on labeled data only, and the other set of models is trained on the unlabeled set using a combination of both unsupervised loss functions in addition to the labeled set.
  • Randomized mix of translations, rotations, flipping, stretching and shearing operations, are used.
  • Dropout is also used.
  • When all of the labeled and unlabeled data are used, with transformation/stability plus the mutual-exclusivity loss function, 3.18% error rate is obtained, while state-of-the-art error rate for this dataset is 3.47% [39].
  • With a larger model (160n vs. 96n), 3.00% error rate is achieved.

4.4. CIFAR100

  • The state-of-the-art error rate for this dataset is 23.82% [39].
  • Sparse convolutional network is used.

4.5. ImageNet

Error rates (%) on ImageNet validation set (Top-5)
  • AlexNet is used.
  • 5 labeled datasets (rep 1 to rep 5) are created from available training samples. Each dataset consists of 10% of training data.
  • Random translations, flipping, and color noise, are used.
  • Two models are trained. One model is trained using labeled data only. The other model is trained on both labeled and unlabeled set using the transformation/stability plus mutual-exclusivity loss function.
  • At each iteration, four different transformed versions of each unlabeled sample are generated.
  • Even for a large dataset with many categories, the proposed unsupervised loss function improves the classification accuracy. The error rate of a single AlexNet model on ImageNet validation set using all training data is 18.2%.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store