Review — Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results

Mean Teacher, Teacher Student Approach, for Semi-Supervised Learning

Sik-Ho Tsang
6 min readSep 25, 2021
Teacher Student Approach (Image from Pixabay)

In this story, Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results, by The Curious AI Company, and Aalto University, is reviewed. In this paper:

  • Mean Teacher is proposed, to average the model weights instead of label predictions as in Temporal Ensembling [13].
  • Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling [13].
  • Without changing the network architecture, Mean Teacher achieves a lower error rate.

This is a paper in 2017 NeurIPS with over 1300 citations. (Sik-Ho Tsang @ Medium) Teacher-Student approach can have other usages such as knowledge distillation. This is one of the beginner papers for teacher-student approach in self-supervised learning.


  1. Conceptual Idea of Applying Noise & Ensembling
  2. Mean Teacher
  3. Experimental Results
Conceptual Idea of Applying Noise & Ensembling
  • A sketch of a binary classification task with two labeled examples (large blue dots) and one unlabeled example, demonstrating how the choice of the unlabeled target (black circle) affects the fitted function (gray curve).
  • (a): A model with no regularization is free to fit any function that predicts the labeled training examples well.
  • (b): A model trained with noisy labeled data (small dots) learns to give consistent predictions around labeled data points.
  • (c): Consistency to noise around unlabeled examples provides additional smoothing. For the clarity of illustration, the teacher model (gray curve) is first fitted to the labeled examples, and then left unchanged during the training of the student model. Also for clarity, the small dots in figures (d) and (e) are omitted.
  • (d): Noise on the teacher model reduces the bias of the targets without additional training. The expected direction of stochastic gradient descent is towards the mean (large blue circle) of individual noisy targets (small blue circles).
  • (e): An ensemble of models gives an even better expected target. Both Temporal Ensembling [13] and the Mean Teacher method use this approach.

Thus, applying noise and using model ensembling are the keys in this paper.

2. Mean Teacher

Mean Teacher

2.1. Applying Noise

  • Both the student and the teacher model evaluate the input applying noise (η, η’) within their computation.

2.2. Model Ensembling Using EMA

  • The softmax output of the student model is compared with the one-hot label using classification cost and with the teacher output using consistency cost.
  • After the weights of the student model have been updated with gradient descent, the teacher model weights are updated as an exponential moving average (EMA) of the student weights.
  • Both model outputs can be used for prediction, but at the end of the training the teacher prediction is more likely to be correct.
  • A training step with an unlabeled example would be similar, except no classification cost would be applied. (i.e. self-supervised learning, but this paper is focusing more on semi-supervised learning.)

2.3. Consistency Cost

  • Specifically, the consistency cost J as the expected distance between the prediction of the student model (with weights θ and noise η) and the prediction of the teacher model (with weights θ’ and noise η’):
  • The difference between the Π model, Temporal Ensembling, and Mean teacher is how the teacher predictions are generated.
  • Whereas the Π model uses θ’=θ, and Temporal Ensembling approximates f(x; θ’; η’) with a weighted average of successive predictions, Mean Teacher defines θ’t at training step t as the EMA of successive  weights:
  • where α is a smoothing coefficient hyperparameter.
  • An additional difference between the three algorithms is that the Π model applies training to θ’ whereas Temporal Ensembling and Mean Teacher treat it as a constant with regards to optimization.
  • Mean square error (MSE) is used to train the consistency cost.

3. Experimental Results

3.1. SVHN & CIAFR-10

Error rate percentage on SVHN over 10 runs (4 runs when using all labels).
Error rate percentage on CIFAR-10 over 10 runs (4 runs when using all labels).
  • All the methods in the comparison use a similar 13-layer ConvNet architecture.
  • Mean Teacher improves test accuracy over the Π model and Temporal Ensembling on semi-supervised SVHN tasks.
  • Mean Teacher also improves results on CIFAR-10 over our baseline Π model.
  • Virtual Adversarial Training (VAT) performs even better than Mean Teacher on the 1000-label SVHN and the 4000-label CIFAR-10. Yet, VAT and Mean Teacher are complimentary approaches.

3.2. SVHN with Extra Unlabeled Data

Error percentage over 10 runs on SVHN with extra unlabeled training data
  • Besides the primary training data, SVHN includes also an extra dataset of 531131 examples. 500 samples are picked from the primary training as the labeled training examples.
  • The rest of the primary training set are together with the extra training set as unlabeled examples.

Mean Teacher again outperforms Π model.

3.3. Analysis of the Training Curves

Smoothened classification cost (top) and classification error (bottom) of Mean Teacher and baseline Π model on SVHN over the first 100000 training steps
  • As expected, the EMA-weighted models (blue and dark gray curves in the bottom row) give more accurate predictions than the bare student models (orange and light gray) after an initial period.

Using the EMA-weighted model as the teacher improves results in the semi-supervised settings.

Mean Teacher helps when labels are scarce.

  • When using 500 labels (middle column) Mean Teacher learns faster, and continues training after the Π model stops improving.
  • On the other hand, in the all-labeled case (left column), Mean Teacher and the Π model behave virtually identically.

Mean Teacher uses unlabeled training data more efficiently than the Π model.

3.4. Mean Teacher with ResNet on CIFAR-10 and ImageNet

Error rate percentage of ResNet Mean Teacher compared to the state of the art
  • Experiments are run using a 12-block (26-layer) Residual Network [8] (ResNet) with Shake-Shake regularization [5] on CIFAR-10.
  • A 50-block (152-layer) ResNeXt architecture is used on ImageNet using 10% of the labels.
  • The results improve remarkably with the better network architecture.

There are also ablation experiments (e.g.: the effect of applying noise) and appendix (e.g.: experimental settings) in the paper. If interested, please feel free to read the paper.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.