Review — UDA: Unsupervised Data Augmentation for Consistency Training
Unsupervised Data Augmentation (UDA) Constrains Model Predictions to Be Invariant to Input Noise
Unsupervised Data Augmentation for Consistency Training
UDA, by Google Research, Brain Team, and Carnegie Mellon University
2020 NeurIPS, Over 800 Citations (Sik-Ho Tsang @ Medium)
Semi Supervised Learning, Data Augmentation, CNN, Image Classification, BERT, NLP, Language Model, Text Classification
- The use of consistency training on a large amount of unlabeled data, is to constrain model predictions to be invariant to input noise. Advanced data augmentation methods, plays a crucial role in semi-supervised learning.
- By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, UDA brings substantial improvements across six language and three vision tasks.
- Unsupervised Data Augmentation (UDA)
- Experimental Results
1. Supervised Data Augmentation
- Let x be the input and y be its ground-truth prediction target.
- A model pθ(y|x) is trained to predict y based on the input x, where θ denotes the model parameters.
- pL(x) and pU(x) are the distributions of labeled and unlabeled examples respectively and f* is the perfect classifier that we hope to learn.
- Data augmentation aims at creating novel and realistic-looking training data by applying a transformation to an example, without changing its label.
- Formally, let q(^x|x) be the augmentation transformation from which one can draw augmented examples ^x based on an original example x.
- For an augmentation transformation to be valid, it is required that any example ^x~q(^x|x) drawn from the distribution shares the same ground-truth label as x. Given a valid augmentation transformation, we can simply minimize the negative log-likelihood on augmented examples.
- In supervised learning, data augmentation is mostly regarded as the “cherry on the cake” which provides a steady but limited performance boost.
2. Unsupervised Data Augmentation (UDA)
- The general form of the works, which utilizes unlabeled examples to enforce smoothness of the model, can be summarized as follows:
- Given an input x, compute the output distribution p(y|x) given x and a noised version p(y|x, ε) by injecting a small noise. The noise can be applied to x or hidden states.
- Minimize a divergence metric between the two distributions D(p(y|x) || p(y|x, ε)).
This procedure enforces the model to be insensitive to the noise and hence smoother with respect to changes in the input (or hidden) space.
- In this work, a particular setting is focused, i.e. ^x=q(x, ε).
How the form or “quality” of the noising operation q can influence the performance of this consistency training framework.
- When jointly trained with labeled examples, a weighting factor λ (λ=1 mostly) is utilized to balance the supervised cross entropy and the unsupervised consistency training loss, which is illustrated in the above figure.
- Formally, the full objective can be written as follows:
- where CE denotes cross entropy, q(^x|x) is a data augmentation transformation and ~θ is a fixed copy of the current parameters indicating that the gradient is not propagated through ~θ, as suggested by VAT.
2.2. Augmentation Strategies for Different Tasks
- RandAugment for Image Classification: In RandAugment, augmentation methods are uniformly sampled from the same set of augmentation transformations in PIL. Compared with AutoAugment, RandAugment is simpler and requires no labeled data as there is no need to search for optimal policies.
- Back-Translation for Text Classification: Back-translation refers to the procedure of translating an existing example x in language A into another language B and then translating it back into A to obtain an augmented example ^x. As observed by, back-translation can generate diverse paraphrases while preserving the semantics of the original sentences. Random sampling with a tunable temperature is used instead of beam search for the generation.
2.3. Confidence-Based Masking
- It is helpful to mask out examples that the current model is not confident about.
- Specifically, in each minibatch, the consistency loss term is computed only on examples whose highest probability among classification categories is greater than a threshold β. β=0.8 for CIFAR-10 and SVHN and β=0.5 for ImageNet.
2.4. Sharpening Predictions
- Since regularizing the predictions to have low entropy has been shown to be beneficial, predictions are sharpen when computing the target distribution on unlabeled examples by using a low Softmax temperature τ.
- When combined with confidence-based masking, the loss on unlabeled examples on a minibatch B is:
- where I() is the indicator function, zy is the logit of label y for example x. τ=0.4 for CIFAR-10, SVHN and ImageNet.
2.5. Domain-relevance Data Filtering
- Model is trained on the in-domain data to infer the labels of data in a large out-of-domain dataset and pick out examples that the model is most confident about.
- Specifically, for each category, all examples are sorted based on the classified probabilities of being in that category and the examples with the highest probabilities are selected.
- Finally, the above figure shows the conceptual results of (e) UDA.
3. Experimental Results
- For language, six text classification benchmark datasets, including IMDb, Yelp-2, Yelp-5, Amazon-2 and Amazon-5 sentiment classification and DBPedia topic classification, are evaluated.
- For vision, two smaller datasets CIFAR-10, SVHN are employed to compare semi-supervised algorithms, as well as ImageNet of a larger scale to test the scalability of UDA.
3.2. Correlation between Supervised and Semi-supervised Performances
- The above 2 tables exhibit a strong correlation of an augmentation’s effectiveness between supervised and semi-supervised settings.
This validates the idea of stronger data augmentations found in supervised learning can always lead to more gains when applied to the semi-supervised learning settings.
3.3. Vary the Size of Labeled Data
First, UDA consistently outperforms the two baselines given different sizes of labeled data.
Moreover, the performance difference between UDA and VAT shows the superiority of data augmentation based noise.
- The difference of UDA and VAT is essentially the noise process. While the noise produced by VAT often contain high-frequency artifacts that do not exist in real images, data augmentation mostly generates diverse and realistic images.
3.4. Vary Model Architecture
Given the same architecture, UDA outperforms all published results by significant margins and nearly matches the fully supervised performance, which uses 10× more labeled examples.
3.5. Text Classification Datasets
- In order to test whether UDA can be combined with the success of unsupervised representation learning, such as BERT, four initialization schemes are further considered: (a) random Transformer; (b) BERTBASE; (c) BERTLARGE; (d) BERTFINETUNE: BERTLARGE fine-tuned on in-domain unlabeled data.
First, even with very few labeled examples, UDA can offer decent or even competitive performances compared to the SOTA model trained with full supervised data.
- Particularly, on binary sentiment analysis tasks, with only 20 supervised examples, UDA outperforms the previous SOTA trained with full supervised data on IMDb and is competitive on Yelp-2 and Amazon-2.
Second, UDA is complementary to transfer learning / representation learning.
- As we can see, when initialized with BERT and further finetuned on in-domain data, UDA can still significantly reduce the error rate from 6.50 to 4.20 on IMDb.
Finally, it is noted that for five-category sentiment classification tasks, there still exists a clear gap between UDA with 500 labeled examples per class and BERT trained on the entire supervised set.
- Intuitively, five-category sentiment classifications are much more difficult than their binary counterparts.
3.6. Scalability Test on the ImageNet Dataset
- ResNet-50 is used.
- In both 10% and the full data settings, UDA consistently brings significant gains compared to the supervised baseline.
This shows UDA is not only able to scale but also able to utilize out-of-domain unlabeled examples to improve model performance.
[2020 NeurIPS] [UDA]
Unsupervised Data Augmentation for Consistency Training
[Google AI Blog]
Pretraining or Weakly/Semi-Supervised Learning
2004 … 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] [SWA & Fast SWA] [S⁴L] 2020 [BiT] [Noisy Student] [SimCLRv2] [UDA]