Review — MixMatch: A Holistic Approach to Semi-Supervised Learning

Data Augmentation Using mixup for Semi-Supervised Learning

  • MixMatch is proposed, which guesses low-entropy labels for data-augmented unlabeled examples, and mixes labeled and unlabeled data using mixup.


  1. MixMatch
  2. Experimental Results

1. MixMatch

1.1. Combined Loss

MixMatch: Augmentation, Averaging and Sharpening
  • MixMatch is a “holistic” approach which incorporates ideas and components from the dominant paradigms for prior SSL.
  • Given a batch X of labeled examples with one-hot targets (representing one of L possible labels) and an equally-sized batch U of unlabeled examples:
  • U0 and X0 are then used in computing separate labeled and unlabeled loss terms.
  • More formally, the combined loss L for semi-supervised learning is defined:
  • where H(p, q) is the cross-entropy between distributions p and q.
  • T=0.5, K=2, α=0.75, and λU are hyperparameters described below.
  • λU has different values for different datasets.
MixMatch Algorithm

1.2. Data Augmentation

  • For each xb in the batch of labeled data X, a transformed version is generated: ^xb = Augment(xb) (algorithm 1, line 3).

1.3. Label Guessing

  • For each unlabeled example in U, MixMatch produces a “guess” for the example’s label using the model’s predictions. This guess is later used in the unsupervised loss term.
  • To do so, the average of the model’s predicted class distributions across all the K augmentations of ub are computed by:

1.4. Sharpening

  • Given the average prediction over augmentations qb, a sharpening function is applied to reduce the entropy of the label distribution. In practice, for the sharpening function, the common approach is to have the “temperature” T for adjustment of this categorical distribution:
  • where p is some input categorical distribution (specifically in MixMatch, p is the average class prediction over augmentations), and T is a temperature hyperparameter.

1.5. mixup

  • A slightly modified version of mixup is used. mixup is the data augmentation technique originally used in supervised learning.
  • For a pair of two examples with their corresponding labels probabilities (x1, p1), (x2, p2), (x’, p') is computed by:
  • where α is a hyperparameter for beta distribution.
  • (Please feel free to read mixup if interested.)
  • To apply mixup, all augmented labeled examples with their labels and all unlabeled examples with their guessed labels are first collected into:
  • Then, these collections are concatenated and shuffled to form W which will serve as a data source for mixup:
  • For each the i-th example-label pair in ^X, mixup is applied using W and add to the collection X’. The remainder of W is used for ^U where mixup is applied and add to the collection U’.

2. Experimental Results

  • Wide ResNet WRN-28 is used as the network model.
Error rate (%) on CIFAR-10 (left) and SVHN (right)
CIFAR-10 and CIFAR-100 error rate
Comparison of error rates for SVHN and SVHN+Extra for MixMatch
STL-10 error rate using 1000-label splits or the entire 5000-label training set
Ablation study results. All values are error rates on CIFAR-10 with 250 or 4000 labels
  • Using a similar EMA from Mean Teacher hurts MixMatch’s performance slightly.
  • Dr. Ian Goodfellow is one of the authors for this paper.



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store