Review — There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

SWA & Fast SWA, Improving Π-Model & Mean Teacher

Sik-Ho Tsang
4 min readMay 15, 2022

There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average
SWA & Fast SWA, by Cornell University
2019 ICLR, Over 100 Citations (

@ Medium)
Semi-Supervised Learning, Image Classification

  • Stochastic Weight Averaging (SWA) is proposed, which averages weight along the trajectory of SGD with a modified learning rate schedule.
  • Fast-SWA is further proposed, which accelerates convergence by averaging multiple points within each cycle of a cyclical learning rate schedule.


  1. Conventional Π-Model & Mean Teacher (MT)
  2. Proposed Stochastic Weight Averaging (SWA) & Fast SWA
  3. Experimental Results

1. Conventional Π-Model & Mean Teacher (MT)

1.1. Consistency Loss

  • In the semi-supervised setting, we have access to labeled data DL={(xLi, yLi)} for i from 1 to NL, and unlabeled data DU={xUi} for i=1 to NU.
  • Given two perturbed inputs x’, x’’ of x and the perturbed weights wf and wg, the consistency loss penalizes the difference between the student’s predicted probabilities f(x’, wf) and the teacher’s g(x’’, wg).
  • This loss is typically the Mean Squared Error or KL divergence:
  • The total loss used to train the model can be written as:
  • where for classification LCE is the cross entropy between the model predictions and supervised training labels. The parameter λ>0 controls the relative importance of the consistency term in the overall loss.

1.2. Π-Model

  • Π-Model uses the student model f as its own teacher.
  • But the data (input) is perturbed by random translations, crops, flips and additive Gaussian noise. Binary dropout is used for weight perturbation.

1.3. Mean Teacher (MT)

  • The teacher weights wg are the exponential moving average (EMA) of the student weights wf:
  • where the decay rate α is usually set between 0.9 and 0.999.

2. Proposed Stochastic Weight Averaging (SWA) & Fast SWA

Left: Cyclical cosine learning rate schedule and SWA and fast-SWA averaging strategies. Middle: solutions explored by the cyclical cosine annealing schedule on an error surface. Right: Fast-SWA averages more points but the errors of the averaged points, as indicated by the heat color, are higher.

2.1. Cyclical Schedule

  • Stochastic Weight Averaging (SWA) is a recent approach by Izmailov et al., 2018, which is based on averaging weights traversed by SGD with a modified learning rate schedule.
  • For the first ll0 epochs, the network is pre-trained using the cosine annealing schedule.
  • After l epochs, a cyclical schedule is used, repeating the learning rates from epochs [l-c, l], where c is the cycle length.

2.2. SWA

  • Left (green dot): SWA collects the networks corresponding to the minimum values of the learning rate and averages their weights. The model with the averaged weights wSWA is then used to make predictions.
  • SWA is applied to the student network both for the Π-Model and Mean Teacher model.
  • However, SWA updates the average weights only once per cycle, which means that many additional training epochs are needed in order to collect enough weights for averaging.

2.3. Fast SWA

  • Left (red dot): A modification of SWA that averages networks corresponding to every k<c epochs starting from epoch l-c. Average multiple weights are obtained within a single epoch setting k<1.

3. Experimental Results

Prediction errors of Π and MT models with and without fast-SWA

For all quantities of labeled data, fast-SWA substantially improves test accuracy in both architectures.

Test errors against current state-of-the-art semi-supervised results

The above table shows the summary that fast-SWA can significantly improve the performance of both the Π-Model and Mean Teacher Model.

Please feel free to read the paper for more detailed results if interested.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.