# Review — UDA: Unsupervised Data Augmentation for Consistency Training

## Unsupervised Data Augmentation (UDA) Constrains Model Predictions to Be** Invariant to Input Noise**

@ Medium)

Unsupervised Data Augmentation for Consistency TrainingUDA, by Google Research, Brain Team, and Carnegie Mellon University2020 NeurIPS, Over 800 Citations(

Semi Supervised Learning, Data Augmentation, CNN, Image Classification, BERT, NLP, Language Model, Text Classification

- The use of
**consistency training**on a large amount of unlabeled data, is to constrain model predictions to be**invariant to input noise**.**Advanced data augmentation methods**, plays a crucial role in semi-supervised learning. **By substituting simple noising operations with advanced data augmentation methods such as****RandAugment****and back-translation**, UDA brings substantial improvements across six language and three vision tasks.

# Outline

**Preliminaries****Unsupervised Data Augmentation (UDA)****Experimental Results**

**1. Supervised Data Augmentation**

- Let
be the*x***input**andbe its*y***ground-truth prediction target**. **A model**is trained to predict*pθ*(*y*|*x*)*y*based on the input*x*, where*θ***model parameters**.and*pL*(*x*)are the*pU*(*x*)**distributions of labeled and unlabeled examples**respectively andis the*f****perfect classifier**that we hope to learn.**Data augmentation**aims at creating novel and realistic-looking training data by**applying a transformation to an example, without changing its label.**- Formally, let
be the*q*(^*x*|*x*)**augmentation transformation**from which one can**draw augmented examples ^***x*based on an original example*x*. - For an augmentation transformation to be valid, it is required that
**any example ^**Given a valid augmentation transformation, we can simply*x*~*q*(^*x*|*x*) drawn from the distribution shares the same ground-truth label as*x*.**minimize the negative log-likelihood**on augmented examples. - In supervised learning, data augmentation is mostly regarded as the “cherry on the cake” which provides a steady but limited performance boost.

# 2. Unsupervised Data Augmentation (UDA)

## 2.1. Loss

- The general form of the works, which utilizes unlabeled examples to enforce smoothness of the model, can be summarized as follows:

- Given an input
*x*, compute the**output distribution**and*p*(*y*|*x*) given*x***a noised version**The noise can be applied to*p*(*y*|*x*,*ε*) by injecting a small noise.*x*or hidden states. **Minimize a divergence metric between the two distributions***D*(p(*y*|*x*) ||*p*(*y*|*x*,*ε*)).

This procedure enforces the model to be

insensitive to the noiseand hencesmoother with respect to changes in the input (or hidden) space.

- In this work, a particular setting is focused, i.e.
**^**.*x*=*q*(*x*,*ε*)

How

the form or “quality” of the noising operationcan influence the performance of this consistency training framework.q

- When jointly trained with labeled examples,
**a weighting factor**(*λ**λ*=1 mostly) is utilized to balance the**supervised cross entropy**and the**unsupervised consistency training loss**, which is illustrated in the above figure. - Formally, the
**full objective**can be written as follows:

- where
**CE**denotes**cross entropy**,is a*q*(^*x*|*x*)**data augmentation transformation**and**~**is a*θ***fixed copy of the current parameters**indicating that the gradient is not propagated through ~*θ*, as suggested by VAT.

## 2.2. Augmentation Strategies for Different Tasks

**RandAugment****for Image Classification**: In RandAugment,**augmentation methods are uniformly sampled from the same set of augmentation transformations in PIL**. Compared with AutoAugment, RandAugment is simpler and requires no labeled data as there is no need to search for optimal policies.**Back-Translation for Text Classification**:**Back-translation**refers to the procedure of**translating an existing example**and*x*in language A into another language B**then translating it back into A**to**obtain an augmented example ^**. As observed by, back-translation can generate*x***diverse paraphrases**while preserving the semantics of the original sentences. Random sampling with a tunable temperature is used instead of beam search for the generation.

## 2.3. Confidence-Based Masking

- It is helpful to mask out examples that the current model is not confident about.
- Specifically, in each minibatch, the consistency loss term is computed only on examples whose
**highest probability among classification categories is greater than a threshold***β.**β*=0.8 for CIFAR-10 and SVHN and*β*=0.5 for ImageNet.

## 2.4. Sharpening Predictions

- Since regularizing the predictions to have low entropy has been shown to be beneficial,
**predictions are sharpen**when computing the target distribution on unlabeled examples by**using a low Softmax temperature***τ*. - When combined with confidence-based masking, the loss on unlabeled examples on a minibatch
*B*is:

- where
*I*() is the indicator function,is the*zy***logit**of label*y*for example*x*.for CIFAR-10, SVHN and ImageNet.*τ*=0.4

## 2.5. Domain-relevance Data Filtering

- Model is trained on the in-domain data to infer the labels of data in a large out-of-domain dataset and pick out examples that the model is most confident about.
- Specifically, for each category, all examples are sorted based on the classified probabilities of being in that category and the examples with the highest probabilities are selected.

- Finally, the above figure shows the conceptual results of (e) UDA.

# 3. Experimental Results

## 3.1. Datasets

- For
**language**,**six text classification benchmark datasets**, including**IMDb**,**Yelp-2**,**Yelp-5**,**Amazon-2**and**Amazon-5**sentiment classification and**DBPedia**topic classification, are evaluated. - For
**vision**,**two smaller datasets CIFAR-10, SVHN**are employed to compare semi-supervised algorithms, as well as**ImageNet**of a larger scale to test the scalability of UDA.

## 3.2. Correlation between Supervised and Semi-supervised Performances

- The above 2 tables

This validates the idea of

stronger data augmentations found in supervised learningcan always lead tomore gainswhen applied to thesemi-supervised learning settings.

## 3.3. Vary the Size of Labeled Data

First,

UDA consistently outperforms the two baselinesgiven different sizes of labeled data.Moreover,

the performance difference between UDA andVATshows the superiority of data augmentation based noise.

- The difference of UDA and VAT is essentially the noise process. While
**the noise produced by****VAT****high-frequency artifacts that do not exist in real images**, data augmentation mostly generates diverse and realistic images.

## 3.4. Vary Model Architecture

Given the same architecture, UDA outperforms all published results by significant marginsandnearly matches the fully supervised performance, which uses 10× more labeled examples.

## 3.5. Text Classification Datasets

- In order to test whether UDA can be combined with the success of unsupervised representation learning, such as BERT,
**four initialization schemes are further considered: (a) random****Transformer****; (b)****BERT****BASE; (c)****BERT****LARGE; (d)****BERT****FINETUNE**: BERTLARGE fine-tuned on in-domain unlabeled data.

First, even with very few labeled examples,UDA can offer decent or even competitive performances compared to the SOTA model trained with full supervised data.

- Particularly, on binary sentiment analysis tasks, with only 20 supervised examples, UDA outperforms the previous SOTA trained with full supervised data on IMDb and is competitive on Yelp-2 and Amazon-2.

Second, UDA is complementary to transfer learning / representation learning.

- As we can see, when initialized with BERT and further finetuned on in-domain data, UDA can still significantly reduce the error rate from 6.50 to 4.20 on IMDb.

Finally, it is noted that for five-category sentiment classification tasks,

there still exists a clear gap between UDA with 500 labeled examples per class andBERTtrained on the entire supervised set.

- Intuitively, five-category sentiment classifications are much more difficult than their binary counterparts.

## 3.6. Scalability Test on the ImageNet Dataset

- ResNet-50 is used.
- In both 10% and the full data settings,
**UDA consistently brings significant gains compared to the supervised baseline.**

This shows UDA is not only able to scale but also

able to utilize out-of-domain unlabeled examples to improve model performance.

## References

[2020 NeurIPS] [UDA]

Unsupervised Data Augmentation for Consistency Training

[Google AI Blog]

https://ai.googleblog.com/2019/07/advancing-semi-supervised-learning-with.html

## Pretraining or Weakly/Semi-Supervised Learning

**2004 … 2019** [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] **2020 **[BiT] [Noisy Student] [SimCLRv2] [UDA]