Review — Noisy Student: Self-training with Noisy Student improves ImageNet classification

Semi-Supervised Learning Using Noisy Student Training, Noise is Added to Student

Sik-Ho Tsang
7 min readJan 28, 2022
Noisy Student Training

Self-training with Noisy Student improves ImageNet classification
Noisy Student
, by Google Research, Brain Team, and Carnegie Mellon University
2020 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Teacher Student Model, Pseudo Label, Semi-Supervised Learning, Image Classification

  • Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise (Dropout, Stochastic Depth, and data augmentation) added to the student during learning. The procedures are:
  1. An EfficientNet model is first trained on labeled images as a teacher to generate pseudo labels for 300M unlabeled images.
  2. Then a larger EfficientNet is trained as a student model on the combination of labeled and pseudo labeled images.
  3. This process is iterated by putting back the student as the teacher.


  1. Noisy Student Training Procedures
  2. Noise Added to Student
  3. Other Techniques
  4. Experimental Results

1. Noisy Student Training Procedures

  • Labeled images {(x1, y1), (x2, y2), …, (xn, yn)}
  • Unlabeled images {~x1; ~x2, …, ~xm}
  • Step 1: Learn teacher model θt* which minimizes the cross entropy loss on labeled images:
  • Step 2: Use a normal (i.e., not noised) teacher model to generate soft or hard pseudo labels for clean (i.e., not distorted) unlabeled images:
  • Soft pseudo labels work better.
  • Step 3: Learn an equal-or-larger student model θs* which minimizes the cross entropy loss on labeled images and unlabeled images with noise added to the student model:
  • Step 4: Iterative training using the student as a teacher and go back to step 2.

2. Noise Added to Student

  • Two types of noise: input noise and model noise.
  1. For input noise, data augmentation with RandAugment [18] is used. To be brief, RandAugment includes augmentation: Brightness, Contrast and Sharpness.
  2. For model noise, Dropout [76] and Stochastic Depth [37] are used.
  • Noise has an important benefit of enforcing invariances in the decision function on both labeled and unlabeled data.
  • Data augmentation is an important noising method in Noisy Student Training because it forces the student to ensure prediction consistency across augmented versions of an image.
  • The teacher produces high-quality pseudo labels by reading in clean images, while the student is required to reproduce those labels with augmented images as input.
  • When Dropout and Stochastic Depth function are used as noise, the teacher behaves like an ensemble at inference time (when it generates pseudo labels), whereas the student behaves like a single model. In other words, the student is forced to mimic a more powerful ensemble model.

3. Other Details

3.1. Other Techniques

  • Noisy Student Training also works better with an additional trick: data filtering and balancing.
  • Images that the teacher model has low confidences (<0.3) on are filtered since they are usually out-of-domain images.
  • The number of unlabeled images for each class needs to be balanced, as all classes in ImageNet have a similar number of labeled images. For this purpose, images in classes where there are not enough images are duplicated. For classes where there are too many images, the images with the highest confidence are taken (At most 130K).

3.2. Some Training Details

  • EfficientNet-L2 is used, which is wider and deeper than EfficientNet-B7 but uses a lower resolution.
  • FixRes is used.
  • JFT is used as unlabeled dataset. the total number of images for training a student model is 130M (with some duplicated images). Due to duplications, there are only 81M unique images among these 130M images.
  • The largest model, EfficientNet-L2, needs to be trained for 6 days on a Cloud TPU v3 Pod, which has 2048 cores if the unlabeled batch size is 14× the labeled batch size.

4. Experimental Results

4.1. SOTA Comparison

Top-1 and Top-5 Accuracy of Noisy Student Training and previous state-of-the-art methods on ImageNet.
  • EfficientNet-L2 with Noisy Student Training achieves 88.4% top-1 accuracy which is significantly better than the best reported accuracy on EfficientNet of 85.0%. The total gain of 3.4% comes from two sources:
  • by making the model larger (+0.5%) and by Noisy Student Training (+2.9%).

In other words, Noisy Student Training makes a much larger impact on the accuracy than changing the architecture.

  • Noisy Student Training outperforms the state-of-the-art accuracy of 86.4% by FixRes ResNeXt-101 WSL [55, 86] that requires 3.5 Billion Instagram images labeled with tags.
  • As a comparison, Noisy Student Training only requires 300M unlabeled images, which is perhaps more easy to collect. And the proposed model is also approximately twice as small in the number of parameters compared to FixRes ResNeXt-101 WSL.
  • Noisy Student Training also outperforms BiT-L if about a half of model size used only.

4.2. Noisy Student Training Without Iterative Training

Noisy Student Training leads to significant improvements across all model sizes
  • Noisy Student Training leads to a consistent improvement of around 0.8% for all model sizes. Overall, EfficientNets with Noisy Student Training provide a much better tradeoff between model size and accuracy than prior works.

4.3. Robustness Benchmarks

Left: Image-A, Middle: Image-C, Right: Image-P (Black: Prediction With Noisy Student, Red: Prediction Without Noisy Student)
  • ImageNet-C and P test sets [31] include images with common corruptions and perturbations such as blurring, fogging, rotation and scaling.
  • ImageNet-A test set [32] consists of difficult images that cause significant drops in accuracy to state-of-the-art models.

The predictions of the standard model are incorrect while the predictions of the model with Noisy Student Training are correct.

Left: Image-A, Middle: Image-C, Right: Image-P
  • On ImageNet-A, it improves the top-1 accuracy from 61.0% to 83.7%.
  • On ImageNet-C, it reduces mean corruption error (mCE) from 45.7 to 28.3.
  • On ImageNet-P, it leads to a mean flip rate (mFR) of 14.2 if we use a resolution of 224×224 (direct comparison) and 12.2 if we use a resolution of 299×299.

4.4. Adversarial Robustness Results

Noisy Student Training improves adversarial robustness against an FGSM attack
  • The performance on adversarial perturbations is studied.
  • FGSM attack performs one gradient descent step on the input image [25] with the update on each pixel set to ε.

Noisy Student Training improves EfficientNet-L2’s accuracy from 1.1% to 4.4% though the model is not optimized for adversarial robustness.

4.5. Ablation Study

Ablation study of noising
  • Noise such as Stochastic Depth, Dropout and data augmentation plays an important role in enabling the student model to perform better than the teacher. The performance consistently drops with noise function removed.
  • It is found that adding noise to the teacher model that generates pseudo labels leads to lower accuracy, which shows the importance of having a powerful unnoised teacher model.
Ablation study of iterative training
  • The model performance improves to 87.6% in the first iteration and then to 88.1% in the second iteration.
  • For the last iteration, a larger ratio between unlabeled batch size and labeled batch size is used to boost the final performance to 88.4%.
  • (More ablations are presented in the appendix of the paper. Please feel free to read if interested.)

Take Away

  1. Using a large teacher model with better performance leads to better results.
  2. A large amount of unlabeled data is necessary for better performance.
  3. Soft pseudo labels work better than hard pseudo labels for out-of-domain data in certain cases.
  4. A large student model is important to enable the student to learn a more powerful model.
  5. Data balancing is useful for small models.
  6. Joint training on labeled data and unlabeled data outperforms the pipeline that first pretrains with unlabeled data and then finetunes on labeled data.
  7. Using a large ratio between unlabeled batch size and labeled batch size enables models to train longer on unlabeled data to achieve a higher accuracy.
  8. Training the student from scratch is sometimes better than initializing the student with the teacher and the student initialized with the teacher still requires a large number of training epochs to perform well.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.