# Review: Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning

## Semi-Supervised Learning Using Different Augmented Version of the Same Sample and Different Perturbations of Network

Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning,Sajjadi NIPS’16, by University of Utah2016 NIPS, Over 500 Citations(Sik-Ho Tsang @ Medium)

Semi-Supervised Learning, Image Classification

**An unsupervised loss function**is proposed to**minimize the difference between the predictions of multiple passes of a training sample**.- Multiple passes of a training sample consist of
**different augmented versions of the same sample**and also**different perturbations of the network.**

# Outline

**Proposed Unsupervised Loss Function****Mutual-Exclusivity Loss Function****Total Loss for Semi-Supervised Learning****Experimental Results**

**1. Proposed Unsupervised Loss Function**

- Given any training sample,
**a model’s prediction should be the same under any random transformation of the data and perturbations to the model.**

The

transformationscan be any linear and non-lineardata augmentationbeing used to extend the training data.

The

disturbancesincludeDropouttechniquesandrandomized poolingschemes.

- In each pass, each sample can be randomly transformed or the hidden nodes can be randomly activated.
- As a result, the network’s prediction can be different for multiple passes of the same training sample.
**The network’s prediction is expected to be the same despite transformations and disturbances.**

An unsupervised loss function is introduced that

minimizes the mean squared differences between different passes of an individual training sample through the network.

- The proposed loss function is completely unsupervised and can be used along with supervised training as a semi-supervised learning method.
- A dataset with
*N*training samples and*C*classes is used. **fj(xi)**is the**classifier’s prediction vector**on the*i*-th training sample during the*j*-th pass through the network.- Each training sample is passed
*n*times through the network. is a*Tj*(*xi*)**random linear or non-linear transformation**on the training sample*xi*before the*j*-th pass through the network.- The proposed loss function, so called
**TS loss**, for each data sample is:

- where
stands for*TS***transformation/stability**. - (The concept of transforming the same sample with different data augmentations is similar to the ones in self-supervised learning nowadays.)

**2. Mutual-Exclusivity Loss Function**

**Mutual-exclusivity loss function of [30]**forces the classifier’s prediction vector to have only one non-zero element. This loss function naturally complements the transformation/stability loss function.- In supervised learning, each element of the prediction vector is pushed towards zero or one depending on the corresponding element in label vector.
**The proposed TS loss in Section 1 here**, minimizes the l2-norm of the difference between predictions of multiple transformed versions of a sample, but**it does not impose any restrictions**on the individual elements of a single prediction vector. As a result, each prediction vector might be a**trivial solution**instead of a valid prediction due to lack of labels.- (This collapsing issue also happens in self-supervised learning.)

Mutual-exclusivity loss function forces each prediction vector to be valid and prevents trivial solutions.

- Suppose
is the*fjk*(*xi*), this loss function for the training sample*k*-th element of prediction vector*fj*(*xi*)*xi*, so called**ME loss**, is defined as follows:

- where
**ME**stands for**mutual-exclusivity**. - In the experiments, we show that the combination of both loss functions leads to further improvements in the accuracy of the models.
**The combination of both loss functions**as transformation/stability plus mutual-exclusivity loss function:

- where
*λ*1=0.1 and*λ*2=1.

**3. Total Loss for Semi-Supervised Learning**

- Though it is said that TS loss and ME loss are used along with supervised loss, but how it is done is not so clear in the paper. It is assumed people know how they are trained with both labeled and unlabeled samples.
- I later found in author’s presentation slide that
**the total loss is the sum of supervised loss and unsupervised loss:**

- where
is*lL***supervised loss**.

# 4. Experimental Results

- Two kinds of models are used: One is convnet model, such as AlexNet. One is sparse convolutional network [39].
*n*=4 for convnet and*n*=5 for sparse convolutional network [39].- (For sparse convolutional network architecture, please feel free to the paper [39] directly. I don’t go into it because in this paper, the losses should be more in focus instead of model architecture.)

## 4.1. MNIST

- Sparse convolutional network [39] is used.
**No data augmentation**for this task. In other words,*Tj*(*xi*) is identity function.**Only****Dropout****is used.**- First, a model is trained based on this labeled set only. Then, the models are trained by adding unsupervised loss functions.
- In separate experiments, transformation/stability loss function, mutual-exclusivity loss function and the combination of both, are added.
- The state-of-the-art
**Maxout****[40]**obtains**error rate of 0.24%**on MNIST using all training data without data augmentation.

Combination of both loss functions reduces the error rate to 0.55%which is the state-of-the-art for the task of MNIST with100 labeled samplesat that moment. It can be seen thata close accuracy is achievedby using only 100 labeled samples.

## 4.2. SVHN and NORB

- NORB is a collection of stereo images in six classes.
- Convnet model is used, which consists of
**two convolutional layers**with 64 maps and kernel size of 5,**two locally connected layers**with 32 maps and kernel size 3. - Different ratios of labeled and unlabeled data are used. 1%, 5%, 10%, 20% and 100% of training samples are used as labeled data.
**Random crop**and**random rotation**are used as data augmentation.

When the number of labeled samples is small, the improvement is more significant.For example, when only 1% of labeled data is used, an improvement in accuracy is about 2.5 times by using unsupervised loss functions.

As more labeled samples are added, the difference in accuracy between semi-supervised and supervised approaches becomes smaller.

**Dropout****random max-pooling**are the only sources of variation in the above table.

By using unsupervised loss functions, the accuracy of the classifier is significantly improvedby trying to minimize the variation in prediction of the network.

- In addition, for NORB dataset, by using only 1% of labeled data and applying unsupervised loss functions, the accuracy obtained is close to the case when 100% of labeled data is used.

## 4.3. CIFAR10

- Sparse convolutional network is used.
- The first set of models is trained on labeled data only, and the other set of models is trained on the unlabeled set using a combination of both unsupervised loss functions in addition to the labeled set.
- Randomized mix of
**translations, rotations, flipping, stretching and shearing operations**, are used. **Dropout**

The combination of unsupervised loss functions on unlabeled data improves the accuracy of the models.

- When all of the labeled and unlabeled data are used, with transformation/stability plus the mutual-exclusivity loss function, 3.18% error rate is obtained, while state-of-the-art error rate for this dataset is 3.47% [39].
- With a larger model (160
*n*vs. 96*n*), 3.00% error rate is achieved.

## 4.4. CIFAR100

- The state-of-the-art error rate for this dataset is 23.82% [39].
- Sparse convolutional network is used.

The proposed loss function minimizes the randomness effect due to Dropout and max-pooling, achieves

21.43% error rate.

## 4.5. ImageNet

- AlexNet is used.
- 5 labeled datasets (rep 1 to rep 5) are created from available training samples. Each dataset consists of 10% of training data.
**Random translations, flipping, and color noise**, are used.- Two models are trained. One model is trained using labeled data only. The other model is trained on both labeled and unlabeled set using the transformation/stability plus mutual-exclusivity loss function.
- At each iteration, four different transformed versions of each unlabeled sample are generated.

When 10% of training data is used as labeled set, the network converges in 20 epochs instead of standard 90 epochs ofAlexNetmodel.

- Even for a large dataset with many categories, the proposed unsupervised loss function improves the classification accuracy. The error rate of a single AlexNet model on ImageNet validation set using all training data is 18.2%.

## References

[2016 NIPS] [Sajjadi NIPS’16]

Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning

## Pretraining or Weakly/Semi-Supervised Learning

**2013** [Pseudo-Label (PL)] **2016 **[Sajjadi NIPS’16] **2017** [Mean Teacher] **2018 **[WSL] **2019 **[Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] **2020 **[BiT] [Noisy Student] [SimCLRv2]