Brief Review — Rethinking Pre-training and Self-training

Self-Training, A Kind of Semi-Supervised Learning, is Revisited

Sik-Ho Tsang
5 min readOct 20, 2022


Rethinking Pre-training and Self-training,
, by Google Research, Brain Team
2020 NeurIPS, Over 300 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Self-Training, Pseudo Label, Semi-Supervised Learning

  • This paper reveals the generality and flexibility of self-training with three additional insights:
  1. Stronger data augmentation and more labeled data further diminish the value of pre-training,
  2. Unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and
  3. In the case that pre-training is helpful, self-training improves upon pre-training.


  1. Data Augmentation, Pre-Training, and Self-Training
  2. Results

1. Data Augmentation, Pre-Training, and Self-Training

Notations for data augmentations and pre-trained models used throughout this work

1.1. Data Augmentation

Finally, four data augmentation policies are used for experimentation: FlipCrop, AutoAugment, AutoAugment with higher scale jittering, RandAugment with higher scale jittering, namely Augment-S1, Augment-S2, Augment-S3 and Augment-S4 respectively.

1.2. Pre-Training

  • ImageNet pre-training is studied. EfficientNet-B7 is the backbone.
  • Rand Init: Training from a random initialization.
  • ImageNet: EfficientNet-B7 checkpoint trained with AutoAugment that achieves 84.5% top-1 accuracy on ImageNet.
  • ImageNet++: EfficientNet-B7 checkpoint trained with the Noisy Student method, which utilizes an additional 300M unlabeled images, that achieves 86.9% top-1 accuracy.

1.3. Self-Training

  • First, a teacher model is trained on the labeled data (e.g., COCO dataset).
  • Then, the teacher model generates pseudo labels on unlabeled data (e.g., ImageNet dataset).
  • Finally, a student is trained to optimize the loss on human labels and pseudo labels jointly.
  • Standard loss as below is unstable:
  • as the total loss magnitude drastically changing as α is varied.
  • Loss Normalization is proposed to stabilize self-training as:
  • where Lh, Lp, bar(Lh) and bar(Lp) are human loss, pseudo loss and their respective moving averages over training.
Performance of Loss Normalization across different data augmentation strengths, training iterations and learning rates
  • Loss Normalization gets better results in almost all settings, and more importantly, helps avoid training instability when α is large.

2. Results

2.1. The effects of augmentation and labeled dataset size on pre-training

The effects of data augmentation and dataset size on pre-training

Left: Pre-training hurts performance when stronger data augmentation is used.

Right: More labeled data diminishes the value of pre-training.

2.2. The effects of augmentation and labeled dataset size on self-training

In regimes where pre-training hurts, self-training with the same data source helps

Self-training helps in high data/strong augmentation regimes, even when pre-training hurts.

Self-training improves performance for all model initializations across all labeled dataset sizes

Self-training works across dataset sizes and is additive to pre-training.

2.3. Self-supervised pre-training also hurts when self-training helps in high data/strong augmentation regimes

Self-supervised pre-training (SimCLR) hurts performance on COCO just like standard supervised pre-training

The self-supervised pre-trained checkpoint hurts performance just as much as supervised pre-training on the COCO dataset.

2.4. Exploring the limits of self-training and pre-training

Comparison with the strong models on COCO object detection

COCO Object Detection: For the largest SpineNet model self-training improves upon the best 52.8AP SpineNet model by +1.5AP to achieve 54.3AP. Across all model variants, self-training obtains at least a +1.5AP gain.

Comparison with state-of-the-art models on PASCAL VOC 2012 val/test set

PASCAL VOC Semantic Segmentation: Self-training improves state-of-the-art by a large margin.

Human labels and pseudo labels on examples selected from PASCAL aug dataset

Pseudo labels are more accurate than noisy human labels.

2.5. The benefit of joint-training

Comparison of pre-training, self-training and joint-training on COCO

Jointly training, where ImageNet classification is trained jointly with COCO object detection, improves the performance.

2.6. The importance of task alignment

Performance on PASCAL VOC 2012 using train or train and aug for the labeled data
  • In [45], pre-training on Open Images hurts the performance.

Self-training on the other hand is very general and can use Open Images successfully to improve COCO performance.

Self-training, as a kind of semi-supervised learning, is revisited in details.


[2020 NeurIPS] [Zoph NeurIPS’20]
Rethinking Pre-training and Self-training

1.1. Image Classification

19892020 [Self-Training] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP]

1.3. Pretraining or Weakly/Semi-Supervised Learning

2004 … 2020 [Self-Training] … 2021 [Curriculum Labeling (CL)] [Su CVPR’21] [Exemplar-v1, Exemplar-v2] [SimPLE] [BYOL+LP]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.