Review — MixMatch: A Holistic Approach to Semi-Supervised Learning
Data Augmentation Using mixup for Semi-Supervised Learning
MixMatch: A Holistic Approach to Semi-Supervised Learning
MixMatch, by Google Research
2019 NeurIPS, Over 1200 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification
- MixMatch is proposed, which guesses low-entropy labels for data-augmented unlabeled examples, and mixes labeled and unlabeled data using mixup.
Outline
- MixMatch
- Experimental Results
1. MixMatch
1.1. Combined Loss
- MixMatch is a “holistic” approach which incorporates ideas and components from the dominant paradigms for prior SSL.
- Given a batch X of labeled examples with one-hot targets (representing one of L possible labels) and an equally-sized batch U of unlabeled examples:
MixMatch produces a processed batch of augmented labeled examples X0 and a batch of augmented unlabeled examples with “guessed” labels U0.
- U0 and X0 are then used in computing separate labeled and unlabeled loss terms.
- More formally, the combined loss L for semi-supervised learning is defined:
- where H(p, q) is the cross-entropy between distributions p and q.
Thus, cross-entropy loss is used for labeled set. And the squared L2 loss is used on predictions and guessed labels.
- T=0.5, K=2, α=0.75, and λU are hyperparameters described below.
- λU has different values for different datasets.
1.2. Data Augmentation
- For each xb in the batch of labeled data X, a transformed version is generated: ^xb = Augment(xb) (algorithm 1, line 3).
For each ub in the batch of unlabeled data U, K augmentations are generated: ^ub,k = Augment(ub); where k is from 1 to K. (algorithm 1, line 5). These individual augmentations are used to generate a “guessed label” qb for each ub.
1.3. Label Guessing
- For each unlabeled example in U, MixMatch produces a “guess” for the example’s label using the model’s predictions. This guess is later used in the unsupervised loss term.
- To do so, the average of the model’s predicted class distributions across all the K augmentations of ub are computed by:
Using data augmentation to obtain an artificial target for an unlabeled example is common in consistency regularization methods.
1.4. Sharpening
- Given the average prediction over augmentations qb, a sharpening function is applied to reduce the entropy of the label distribution. In practice, for the sharpening function, the common approach is to have the “temperature” T for adjustment of this categorical distribution:
- where p is some input categorical distribution (specifically in MixMatch, p is the average class prediction over augmentations), and T is a temperature hyperparameter.
Lowering the temperature encourages the model to produce lower-entropy predictions.
1.5. mixup
- A slightly modified version of mixup is used. mixup is the data augmentation technique originally used in supervised learning.
- For a pair of two examples with their corresponding labels probabilities (x1, p1), (x2, p2), (x’, p') is computed by:
- where α is a hyperparameter for beta distribution.
- (Please feel free to read mixup if interested.)
- To apply mixup, all augmented labeled examples with their labels and all unlabeled examples with their guessed labels are first collected into:
- Then, these collections are concatenated and shuffled to form W which will serve as a data source for mixup:
- For each the i-th example-label pair in ^X, mixup is applied using W and add to the collection X’. The remainder of W is used for ^U where mixup is applied and add to the collection U’.
Thus, MixMatch transforms X into X’, a collection of labeled examples which have had data augmentation and mixup (potentially mixed with an unlabeled example) applied.
Similarly, U is transformed into U’, a collection of multiple augmentations of each unlabeled example with corresponding label guesses.
2. Experimental Results
- Wide ResNet WRN-28 is used as the network model.
CIFAR-10 (Left): MixMatch outperforms all other methods by a significant margin, for example reaching an error rate of 6.24% with 4000 labels.
SVHN (Right): MixMatch’s performance to be relatively constant (and better than all other methods) across all amounts of labeled data.
In general, MixMatch matches or outperforms the best results from SWA [2].
On both training sets, MixMatch nearly matches the fully-supervised performance on the same training set almost immediately.
With 1000 examples, MixMatch surpasses both the state-of-the-art for 1000 examples as well as the state-of-the-art using all 5000 labeled examples.
- Using a similar EMA from Mean Teacher hurts MixMatch’s performance slightly.
Each component contributes to MixMatch’s performance.
- Dr. Ian Goodfellow is one of the authors for this paper.
Reference
[2019 NeurIPS] [MixMatch]
MixMatch: A Holistic Approach to Semi-Supervised Learning
Pretraining or Weakly/Semi-Supervised Learning
2004 [Entropy Minimization, EntMin] 2013 [Pseudo-Label (PL)] 2015 [Ladder Network, Γ-Model] 2016 [Sajjadi NIPS’16] 2017 [Mean Teacher] [PATE & PATE-G] [Π-Model, Temporal Ensembling] 2018 [WSL] [Oliver NeurIPS’18] 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] 2020 [BiT] [Noisy Student] [SimCLRv2]