Review — Billion-Scale Semi-Supervised Learning for Image Classification

Teacher Student Model for Semi-Supervised Learning Using 1 Billion Unlabeled Data

Sik-Ho Tsang
5 min readJan 22, 2022
Semi-Supervised Learning Procedures

Billion-Scale Semi-Supervised Learning for Image Classification
Billion-Scale, by Facebook AI
2019 arXiv, Over 200 Citations (Sik-Ho Tsang @ Medium)
Teacher Student Model, Semi-Supervised Learning, Image Classification, Video Classification

  • A semi-supervised learning based on teacher/student paradigm is proposed, which leverages a large collection of unlabelled images (up to 1 billion). By semi-supervised learning, vanilla ResNet-50 achieves 81.2% top-1 accuracy on ImageNet benchmark.
  • A list of recommendations is suggested for large-scale semi-supervised learning.


  1. A List of Recommendations for Semi-Supervised Learning
  2. Proposed Teacher/Student Paradigm for Semi-Supervised Learning
  3. Experimental Results

1. A List of Recommendations for Semi-Supervised Learning

  1. Train with a teacher/student paradigm: It produces a better model for a fixed complexity, even if the student and teacher have the same architecture.
  2. Fine-tune the model with true labels only.
  3. Large-scale unlabelled dataset is key to performance.
  4. Use a large number of pre-training iterations (as opposed to a vanilla supervised setting, where a number of epochs as used in common practice is sufficient).
  5. Build a balanced distribution for inferred labels.
  6. Pre-training the (high-capacity) teacher model by weak supervision (tags) further improves the results.
  • The above recommendations are made based on the findings in this paper.

2. Proposed Teacher/Student Paradigm for Semi-Supervised Learning

2.1. Procedures

Proposed Teacher/Student Paradigm for Semi-Supervised Learning
  • The semi-supervised strategy is depicted in the above figure, also in the animated GIF at the top.
  1. Train on the labeled data to get an initial teacher model.
  2. For each class/label, the predictions of this teacher model are used to rank the unlabeled images and pick top-K images to construct a new training data.
  3. For each image, only the classes associated with the P highest scores are retained.
  4. This new training data is used to train a student model, which typically differs from the teacher model: the complexity at test time is smaller.
  5. Finally, pre-trained student model is fine-tuned on the initial labeled data to circumvent potential labeling errors.

2.2. Some Practical Details

  • Unlabeled dataset U: 1) YFCC-100M [38] is a publicly available dataset of about 90 million images from Flickr. 2)IG-1B-Targeted: Following [27], a dataset of 1B public images with associated hashtags from a social media website.
  • Labeled dataset D: The standard ImageNet with 1000 classes.
  • Models: ResNet and ResNeXt.
  • Training: 64 GPUs across 8 machines. Each GPU processes 24 images at a time. Batch normalization is used for each conv layer on each GPU. Thus, overall minibatch size is of 64×24 = 1536.

3. Experimental Results

3.1. YFCC-100M as Unlabelled Dataset

ImageNet1k-val top-1 accuracy for students models
  • Teacher model brings a significant improvement over the supervised baseline for various capacity target models (1.6-2.6%).
  • Fine-tuning the model on clean labeled data is crucial to achieve good performance.
Varying the teacher capacity for training a ResNet-50 student model with our approach.
  • The accuracy of ResNet-50 student model improves as increasing the strength of teacher model until ResNeXt-101 32×16. Increasing the teacher model capacity further has no effect on the performance of the student model.
  • Interestingly, even in a case of where both teacher and student models are ResNet-50, an improvement of around 1% over the supervised baseline is obtained.
Self-training: top-1 accuracy of ResNet and ResNeXt models self-trained on the YFCC dataset.
  • Self-training when the teacher and student models have the same architecture and capacity.
  • Higher capacity models have relatively higher accuracy gains.
Left: Size of the unlabeled dataset U. Middle: Effect of number of training iterations. Right: Sampling hyperparameter K.
  • Left: A fixed accuracy improvement is achieved every time the dataset size is doubled until reaching the dataset size of 25M.
  • Middle: The performance keeps on improving as the number of processed images is increased.
  • Right: The performance first improves as the value of K is increased to 8k due to increase in diversity as well as hardness of examples. It is stable in a broad 4k-32k regime. Increasing K further introduces a lot of labeling noise in ^D and the accuracy drops.
Examples of images from YFCC100M collected by our procedure for the classes “Tiger shark”, “Leaf beetle” and “American black bear” for a few ranks.
  • The images at the top of ranking are simple and clean without much labelling noise.
  • They become progressively less obvious positives as we go down in the ranking.

3.2. IG-1B-Targeted for Semi-Weakly-Supervised Learning

Semi-Weakly-Supervised Learning Procedures
State of the art on ImageNet with standard architectures (ResNet, ResNeXt).
  • With hashtags as labels, weakly supervised learning is performed.
  • Leveraging weakly-supervised data to pre-train the teacher model significantly improves the results.

3.3. Video Classification

Accuracy on Kinetics video dataset for different approaches using R(2+1)D models
  • The popular multi-class Kinetics video benchmark which has 246k training videos and 400 human action labels. The models are evaluated on the 20k validation videos.
  • Similar to IG-1B-Targeted, an IG-Kinetics of 65 million videos is constructed by leveraging 359 hashtags that are semantically relevant to Kinetics label space.
  • The teacher is a weakly-supervised R(2+1)D-34 model with clip length 32. It is pre-trained with IG-Kinetics dataset and fine-tuned with labeled Kinetics videos.
  • 10 clips are uniformly sampled from each video and the softmax predictions are averaged to produce video level predictions.
  • For the proposed approach, we use K = 4k and P = 4 and IG-Kinetics as unlabeled data U to train student models.
  • The proposed approach gives significant improvements over fully-supervised training. further gains are observed over the competitive weakly-supervised pretraining approach with models having lower FLOPS benefiting the most.

3.4. Transfer Learning

CUB2011: Transfer learning accuracy (ResNet-50).
  • Two transfer learning settings are investigated: (1) full-ft involves fine-tuning the full network, and, (2) fc-only involves extracting features from the final fc layer and training a logistic regressor.
  • Results are particularly impressive for fc-only setting, where the proposed semi-weakly supervised model outperforms highly competitive weakly-supervised model by 6.7%.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.