Review — MixMatch: A Holistic Approach to Semi-Supervised Learning

Data Augmentation Using mixup for Semi-Supervised Learning

Sik-Ho Tsang
5 min readMay 2, 2022

MixMatch: A Holistic Approach to Semi-Supervised Learning
, by Google Research
2019 NeurIPS, Over 1200 Citations (

@ Medium)
Semi-Supervised Learning, Image Classification

  • MixMatch is proposed, which guesses low-entropy labels for data-augmented unlabeled examples, and mixes labeled and unlabeled data using mixup.


  1. MixMatch
  2. Experimental Results

1. MixMatch

1.1. Combined Loss

MixMatch: Augmentation, Averaging and Sharpening
  • MixMatch is a “holistic” approach which incorporates ideas and components from the dominant paradigms for prior SSL.
  • Given a batch X of labeled examples with one-hot targets (representing one of L possible labels) and an equally-sized batch U of unlabeled examples:

MixMatch produces a processed batch of augmented labeled examples X0 and a batch of augmented unlabeled examples with “guessed” labels U0.

  • U0 and X0 are then used in computing separate labeled and unlabeled loss terms.
  • More formally, the combined loss L for semi-supervised learning is defined:
  • where H(p, q) is the cross-entropy between distributions p and q.

Thus, cross-entropy loss is used for labeled set. And the squared L2 loss is used on predictions and guessed labels.

  • T=0.5, K=2, α=0.75, and λU are hyperparameters described below.
  • λU has different values for different datasets.
MixMatch Algorithm

1.2. Data Augmentation

  • For each xb in the batch of labeled data X, a transformed version is generated: ^xb = Augment(xb) (algorithm 1, line 3).

For each ub in the batch of unlabeled data U, K augmentations are generated: ^ub,k = Augment(ub); where k is from 1 to K. (algorithm 1, line 5). These individual augmentations are used to generate a “guessed label” qb for each ub.

1.3. Label Guessing

  • For each unlabeled example in U, MixMatch produces a “guess” for the example’s label using the model’s predictions. This guess is later used in the unsupervised loss term.
  • To do so, the average of the model’s predicted class distributions across all the K augmentations of ub are computed by:

Using data augmentation to obtain an artificial target for an unlabeled example is common in consistency regularization methods.

1.4. Sharpening

  • Given the average prediction over augmentations qb, a sharpening function is applied to reduce the entropy of the label distribution. In practice, for the sharpening function, the common approach is to have the “temperature” T for adjustment of this categorical distribution:
  • where p is some input categorical distribution (specifically in MixMatch, p is the average class prediction over augmentations), and T is a temperature hyperparameter.

Lowering the temperature encourages the model to produce lower-entropy predictions.

1.5. mixup

  • A slightly modified version of mixup is used. mixup is the data augmentation technique originally used in supervised learning.
  • For a pair of two examples with their corresponding labels probabilities (x1, p1), (x2, p2), (x’, p') is computed by:
  • where α is a hyperparameter for beta distribution.
  • (Please feel free to read mixup if interested.)
  • To apply mixup, all augmented labeled examples with their labels and all unlabeled examples with their guessed labels are first collected into:
  • Then, these collections are concatenated and shuffled to form W which will serve as a data source for mixup:
  • For each the i-th example-label pair in ^X, mixup is applied using W and add to the collection X’. The remainder of W is used for ^U where mixup is applied and add to the collection U’.

Thus, MixMatch transforms X into X, a collection of labeled examples which have had data augmentation and mixup (potentially mixed with an unlabeled example) applied.

Similarly, U is transformed into U, a collection of multiple augmentations of each unlabeled example with corresponding label guesses.

2. Experimental Results

  • Wide ResNet WRN-28 is used as the network model.
Error rate (%) on CIFAR-10 (left) and SVHN (right)

CIFAR-10 (Left): MixMatch outperforms all other methods by a significant margin, for example reaching an error rate of 6.24% with 4000 labels.

SVHN (Right): MixMatch’s performance to be relatively constant (and better than all other methods) across all amounts of labeled data.

CIFAR-10 and CIFAR-100 error rate

In general, MixMatch matches or outperforms the best results from SWA [2].

Comparison of error rates for SVHN and SVHN+Extra for MixMatch

On both training sets, MixMatch nearly matches the fully-supervised performance on the same training set almost immediately.

STL-10 error rate using 1000-label splits or the entire 5000-label training set

With 1000 examples, MixMatch surpasses both the state-of-the-art for 1000 examples as well as the state-of-the-art using all 5000 labeled examples.

Ablation study results. All values are error rates on CIFAR-10 with 250 or 4000 labels
  • Using a similar EMA from Mean Teacher hurts MixMatch’s performance slightly.

Each component contributes to MixMatch’s performance.

  • Dr. Ian Goodfellow is one of the authors for this paper.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.