Review — MixMatch: A Holistic Approach to Semi-Supervised Learning

Data Augmentation Using mixup for Semi-Supervised Learning

5 min readMay 2, 2022

--

MixMatch: A Holistic Approach to Semi-Supervised Learning
MixMatch, by Google Research
2019 NeurIPS, Over 1200 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Image Classification

MixMatch is proposed, which guesses low-entropy labels for data-augmented unlabeled examples, and mixes labeled and unlabeled data using mixup.

Outline

MixMatch
Experimental Results

1. MixMatch

1.1. Combined Loss

MixMatch: Augmentation, Averaging and Sharpening

MixMatch is a “holistic” approach which incorporates ideas and components from the dominant paradigms for prior SSL.
Given a batch X of labeled examples with one-hot targets (representing one of L possible labels) and an equally-sized batch U of unlabeled examples:

MixMatch produces a processed batch of augmented labeled examples X0 and a batch of augmented unlabeled examples with “guessed” labels U0.

U0 and X0 are then used in computing separate labeled and unlabeled loss terms.
More formally, the combined loss L for semi-supervised learning is defined:

where H(p, q) is the cross-entropy between distributions p and q.

Thus, cross-entropy loss is used for labeled set. And the squared L2 loss is used on predictions and guessed labels.

T=0.5, K=2, α=0.75, and λU are hyperparameters described below.
λU has different values for different datasets.

MixMatch Algorithm

1.2. Data Augmentation

For each xb in the batch of labeled data X, a transformed version is generated: ^xb = Augment(xb) (algorithm 1, line 3).

For each ub in the batch of unlabeled data U, K augmentations are generated: ^ub,k = Augment(ub); where k is from 1 to K. (algorithm 1, line 5). These individual augmentations are used to generate a “guessed label” qb for each ub.

1.3. Label Guessing

For each unlabeled example in U, MixMatch produces a “guess” for the example’s label using the model’s predictions. This guess is later used in the unsupervised loss term.
To do so, the average of the model’s predicted class distributions across all the K augmentations of ub are computed by:

Using data augmentation to obtain an artificial target for an unlabeled example is common in consistency regularization methods.

1.4. Sharpening

Given the average prediction over augmentations qb, a sharpening function is applied to reduce the entropy of the label distribution. In practice, for the sharpening function, the common approach is to have the “temperature” T for adjustment of this categorical distribution:

where p is some input categorical distribution (specifically in MixMatch, p is the average class prediction over augmentations), and T is a temperature hyperparameter.

Lowering the temperature encourages the model to produce lower-entropy predictions.

1.5. mixup

mixup

A slightly modified version of mixup is used. mixup is the data augmentation technique originally used in supervised learning.
For a pair of two examples with their corresponding labels probabilities (x1, p1), (x2, p2), (x’, p') is computed by:

where α is a hyperparameter for beta distribution.
(Please feel free to read mixup if interested.)
To apply mixup, all augmented labeled examples with their labels and all unlabeled examples with their guessed labels are first collected into:

Then, these collections are concatenated and shuffled to form W which will serve as a data source for mixup:

For each the i-th example-label pair in ^X, mixup is applied using W and add to the collection X’. The remainder of W is used for ^U where mixup is applied and add to the collection U’.

Thus, MixMatch transforms X into X’, a collection of labeled examples which have had data augmentation and mixup (potentially mixed with an unlabeled example) applied.
Similarly, U is transformed into U’, a collection of multiple augmentations of each unlabeled example with corresponding label guesses.

2. Experimental Results

Wide ResNet WRN-28 is used as the network model.

Error rate (%) on CIFAR-10 (left) and SVHN (right)

CIFAR-10 (Left): MixMatch outperforms all other methods by a significant margin, for example reaching an error rate of 6.24% with 4000 labels.
SVHN (Right): MixMatch’s performance to be relatively constant (and better than all other methods) across all amounts of labeled data.

CIFAR-10 and CIFAR-100 error rate

In general, MixMatch matches or outperforms the best results from SWA [2].

Comparison of error rates for SVHN and SVHN+Extra for MixMatch

On both training sets, MixMatch nearly matches the fully-supervised performance on the same training set almost immediately.

STL-10 error rate using 1000-label splits or the entire 5000-label training set

With 1000 examples, MixMatch surpasses both the state-of-the-art for 1000 examples as well as the state-of-the-art using all 5000 labeled examples.

Ablation study results. All values are error rates on CIFAR-10 with 250 or 4000 labels

Using a similar EMA from Mean Teacher hurts MixMatch’s performance slightly.

Each component contributes to MixMatch’s performance.

Dr. Ian Goodfellow is one of the authors for this paper.

Reference

[2019 NeurIPS] [MixMatch]
MixMatch: A Holistic Approach to Semi-Supervised Learning

Pretraining or Weakly/Semi-Supervised Learning

2004 [Entropy Minimization, EntMin] 2013 [Pseudo-Label (PL)] 2015 [Ladder Network, Γ-Model] 2016 [Sajjadi NIPS’16] 2017 [Mean Teacher] [PATE & PATE-G] [Π-Model, Temporal Ensembling] 2018 [WSL] [Oliver NeurIPS’18] 2019 [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] 2020 [BiT] [Noisy Student] [SimCLRv2]

My Other Previous Paper Readings

Artificial Intelligence

Semi Supervised Learning

Data Augmentation

Image Classification

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Responses (1)

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams