Brief Review — Rethinking Pre-training and Self-training
Self-Training, A Kind of Semi-Supervised Learning, is Revisited
Rethinking Pre-training and Self-training,
Self-Training, by Google Research, Brain Team
2020 NeurIPS, Over 300 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Self-Training, Pseudo Label, Semi-Supervised Learning
- This paper reveals the generality and flexibility of self-training with three additional insights:
- Stronger data augmentation and more labeled data further diminish the value of pre-training,
- Unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and
- In the case that pre-training is helpful, self-training improves upon pre-training.
Outline
- Data Augmentation, Pre-Training, and Self-Training
- Results
1. Data Augmentation, Pre-Training, and Self-Training
1.1. Data Augmentation
- Augmentation policies based on the standard flip and crop augmentation in RetinaNet, AutoAugment, and RandAugment, are used.
- Scale jittering is increased (0.5, 2.0) in AutoAugment and RandAugment and the performances are found to be significantly improved.
- For RandAugment, a magnitude of 10 is used for all models.
Finally, four data augmentation policies are used for experimentation: FlipCrop, AutoAugment, AutoAugment with higher scale jittering, RandAugment with higher scale jittering, namely Augment-S1, Augment-S2, Augment-S3 and Augment-S4 respectively.
1.2. Pre-Training
- ImageNet pre-training is studied. EfficientNet-B7 is the backbone.
- Rand Init: Training from a random initialization.
- ImageNet: EfficientNet-B7 checkpoint trained with AutoAugment that achieves 84.5% top-1 accuracy on ImageNet.
- ImageNet++: EfficientNet-B7 checkpoint trained with the Noisy Student method, which utilizes an additional 300M unlabeled images, that achieves 86.9% top-1 accuracy.
1.3. Self-Training
- First, a teacher model is trained on the labeled data (e.g., COCO dataset).
- Then, the teacher model generates pseudo labels on unlabeled data (e.g., ImageNet dataset).
- Finally, a student is trained to optimize the loss on human labels and pseudo labels jointly.
- Standard loss as below is unstable:
- as the total loss magnitude drastically changing as α is varied.
- Loss Normalization is proposed to stabilize self-training as:
- where Lh, Lp, bar(Lh) and bar(Lp) are human loss, pseudo loss and their respective moving averages over training.
- Loss Normalization gets better results in almost all settings, and more importantly, helps avoid training instability when α is large.
2. Results
2.1. The effects of augmentation and labeled dataset size on pre-training
Left: Pre-training hurts performance when stronger data augmentation is used.
Right: More labeled data diminishes the value of pre-training.
2.2. The effects of augmentation and labeled dataset size on self-training
Self-training helps in high data/strong augmentation regimes, even when pre-training hurts.
Self-training works across dataset sizes and is additive to pre-training.
2.3. Self-supervised pre-training also hurts when self-training helps in high data/strong augmentation regimes
The self-supervised pre-trained checkpoint hurts performance just as much as supervised pre-training on the COCO dataset.
2.4. Exploring the limits of self-training and pre-training
COCO Object Detection: For the largest SpineNet model self-training improves upon the best 52.8AP SpineNet model by +1.5AP to achieve 54.3AP. Across all model variants, self-training obtains at least a +1.5AP gain.
PASCAL VOC Semantic Segmentation: Self-training improves state-of-the-art by a large margin.
Pseudo labels are more accurate than noisy human labels.
2.5. The benefit of joint-training
Jointly training, where ImageNet classification is trained jointly with COCO object detection, improves the performance.
2.6. The importance of task alignment
- In [45], pre-training on Open Images hurts the performance.
Self-training on the other hand is very general and can use Open Images successfully to improve COCO performance.
Self-training, as a kind of semi-supervised learning, is revisited in details.
Reference
[2020 NeurIPS] [Zoph NeurIPS’20]
Rethinking Pre-training and Self-training
1.1. Image Classification
1989 … 2020 [Self-Training] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP]
1.3. Pretraining or Weakly/Semi-Supervised Learning
2004 … 2020 [Self-Training] … 2021 [Curriculum Labeling (CL)] [Su CVPR’21] [Exemplar-v1, Exemplar-v2] [SimPLE] [BYOL+LP]