# Review — MixMatch: A Holistic Approach to Semi-Supervised Learning

## Data Augmentation Using mixup for Semi-Supervised Learning

--

MixMatch: A Holistic Approach to Semi-Supervised Learning, by Google Research

MixMatch2019 NeurIPS, Over 1200 Citations(Sik-Ho Tsang @ Medium)

Semi-Supervised Learning, Image Classification

- MixMatch is proposed, which
**guesses low-entropy labels**for**data-augmented unlabeled examples, and mixes labeled and unlabeled data using****mixup**.

# Outline

**MixMatch****Experimental Results**

# 1. MixMatch

## 1.1. Combined Loss

- MixMatch is a “holistic” approach which incorporates ideas and components from the dominant paradigms for prior SSL.
- Given
**a batch**(representing one of*X*of labeled examples with one-hot targets*L*possible labels) and**an equally-sized batch***U*of unlabeled examples:

MixMatch produces a processed batch of augmented labeled examplesX0 and a batch of augmented unlabeled examples with “guessed” labelsU0.

and*U*0are then used in*X*0**computing separate labeled and unlabeled loss terms.**- More formally, the
**combined loss**for semi-supervised learning is defined:*L*

- where
is the*H*(*p*,*q*)**cross-entropy between distributions**.*p*and*q*

Thus,

cross-entropy lossis used forlabeled set. And thesquared L2 loss is used on predictions and guessed labels.

are hyperparameters described below.*T*=0.5,*K*=2,*α*=0.75, and*λU**λU*has different values for different datasets.

## 1.2. Data Augmentation

- For each
*xb*in the batch of labeled data*X*, a transformed version is generated: ^*xb*= Augment(*xb*) (algorithm 1, line 3).

For

eachthe batch ofubinunlabeled data,U: ^Kaugmentations are generatedub,k= Augment(ub); wherekis from 1 toK. (algorithm 1, line 5). These individual augmentations are used togenerate a “guessed label”qbfor eachub.

## 1.3. Label Guessing

- For each unlabeled example in
*U*,**MixMatch produces a “guess” for the example’s label**using the model’s predictions. This guess is later used in the unsupervised loss term. - To do so,
**the average of the model’s predicted class distributions across all the**are computed by:*K*augmentations of*ub*

Using data augmentation to obtain an artificial target for an unlabeled exampleis common inconsistency regularizationmethods.

## 1.4. Sharpening

- Given the average prediction over augmentations
*qb*,**a sharpening function is applied to reduce the entropy of the label distribution.**In practice, for the sharpening function, the common approach is to have the**“temperature”**of this categorical distribution:*T*for adjustment

- where
is some input categorical distribution (specifically in MixMatch,*p**p*is the**average class prediction over augmentations**), andis a*T***temperature**hyperparameter.

Lowering the temperature encourages the model to produce lower-entropy predictions.

## 1.5. mixup

- A slightly modified version of mixup is used. mixup is the data augmentation technique originally used in supervised learning.
- For a pair of two examples with their corresponding labels probabilities (
*x*1,*p*1), (*x*2,*p*2), (x’,*p*') is computed by:

- where
*α*is a hyperparameter for beta distribution. - (Please feel free to read mixup if interested.)
- To apply mixup, all augmented labeled examples with their labels and all unlabeled examples with their guessed labels are first collected into:

- Then, these collections are concatenated and shuffled to form
*W*which will serve as a data source for mixup:

- For each the
*i*-th example-label pair in ^*X*, mixup is applied using*W*and add to the collection*X*’. The remainder of*W*is used for ^*U*where mixup is applied and add to the collection*U*’.

Thus,

MixMatch transforms, a collection of labeled examples which have had data augmentation and mixup (potentially mixed with an unlabeled example) applied.XintoX’Similarly,

, a collection of multiple augmentations of each unlabeled example with corresponding label guesses.Uis transformed intoU’

# 2. Experimental Results

- Wide ResNet WRN-28 is used as the network model.

CIFAR-10 (Left):MixMatch outperforms all other methodsby a significant margin, for example reachingan error rate of 6.24% with 4000 labels.

SVHN (Right):MixMatch’s performance to be relatively constant(and better than all other methods) across all amounts of labeled data.

In general,

MixMatch matches or outperforms the best results from SWA [2].

On both training sets,

MixMatch nearly matches the fully-supervised performanceon the same training set almost immediately.

With 1000 examples, MixMatch surpasses both the state-of-the-art for 1000 examplesas well asthe state-of-the-art using all 5000 labeled examples.

- Using a similar EMA from Mean Teacher hurts MixMatch’s performance slightly.

Each component contributes to MixMatch’s performance.

- Dr. Ian Goodfellow is one of the authors for this paper.

## Reference

[2019 NeurIPS] [MixMatch]

MixMatch: A Holistic Approach to Semi-Supervised Learning

## Pretraining or Weakly/Semi-Supervised Learning

**2004 **[Entropy Minimization, EntMin] **2013** [Pseudo-Label (PL)] **2015** [Ladder Network, Γ-Model] **2016 **[Sajjadi NIPS’16] **2017** [Mean Teacher] [PATE & PATE-G] [Π-Model, Temporal Ensembling] **2018 **[WSL] [Oliver NeurIPS’18] **2019** [VAT] [Billion-Scale] [Label Propagation] [Rethinking ImageNet Pre-training] [MixMatch] **2020 **[BiT] [Noisy Student] [SimCLRv2]