Review — Billion-Scale Semi-Supervised Learning for Image Classification
Teacher Student Model for Semi-Supervised Learning Using 1 Billion Unlabeled Data
5 min readJan 22, 2022
Billion-Scale Semi-Supervised Learning for Image Classification
Billion-Scale, by Facebook AI
2019 arXiv, Over 200 Citations (Sik-Ho Tsang @ Medium)
Teacher Student Model, Semi-Supervised Learning, Image Classification, Video Classification
- A semi-supervised learning based on teacher/student paradigm is proposed, which leverages a large collection of unlabelled images (up to 1 billion). By semi-supervised learning, vanilla ResNet-50 achieves 81.2% top-1 accuracy on ImageNet benchmark.
- A list of recommendations is suggested for large-scale semi-supervised learning.
Outline
- A List of Recommendations for Semi-Supervised Learning
- Proposed Teacher/Student Paradigm for Semi-Supervised Learning
- Experimental Results
1. A List of Recommendations for Semi-Supervised Learning
- Train with a teacher/student paradigm: It produces a better model for a fixed complexity, even if the student and teacher have the same architecture.
- Fine-tune the model with true labels only.
- Large-scale unlabelled dataset is key to performance.
- Use a large number of pre-training iterations (as opposed to a vanilla supervised setting, where a number of epochs as used in common practice is sufficient).
- Build a balanced distribution for inferred labels.
- Pre-training the (high-capacity) teacher model by weak supervision (tags) further improves the results.
- The above recommendations are made based on the findings in this paper.
2. Proposed Teacher/Student Paradigm for Semi-Supervised Learning
2.1. Procedures
- The semi-supervised strategy is depicted in the above figure, also in the animated GIF at the top.
- Train on the labeled data to get an initial teacher model.
- For each class/label, the predictions of this teacher model are used to rank the unlabeled images and pick top-K images to construct a new training data.
- For each image, only the classes associated with the P highest scores are retained.
- This new training data is used to train a student model, which typically differs from the teacher model: the complexity at test time is smaller.
- Finally, pre-trained student model is fine-tuned on the initial labeled data to circumvent potential labeling errors.
2.2. Some Practical Details
- Unlabeled dataset U: 1) YFCC-100M [38] is a publicly available dataset of about 90 million images from Flickr. 2)IG-1B-Targeted: Following [27], a dataset of 1B public images with associated hashtags from a social media website.
- Labeled dataset D: The standard ImageNet with 1000 classes.
- Models: ResNet and ResNeXt.
- Training: 64 GPUs across 8 machines. Each GPU processes 24 images at a time. Batch normalization is used for each conv layer on each GPU. Thus, overall minibatch size is of 64×24 = 1536.
3. Experimental Results
3.1. YFCC-100M as Unlabelled Dataset
- Teacher model brings a significant improvement over the supervised baseline for various capacity target models (1.6-2.6%).
- Fine-tuning the model on clean labeled data is crucial to achieve good performance.
- The accuracy of ResNet-50 student model improves as increasing the strength of teacher model until ResNeXt-101 32×16. Increasing the teacher model capacity further has no effect on the performance of the student model.
- Interestingly, even in a case of where both teacher and student models are ResNet-50, an improvement of around 1% over the supervised baseline is obtained.
- Self-training when the teacher and student models have the same architecture and capacity.
- Higher capacity models have relatively higher accuracy gains.
- Left: A fixed accuracy improvement is achieved every time the dataset size is doubled until reaching the dataset size of 25M.
- Middle: The performance keeps on improving as the number of processed images is increased.
- Right: The performance first improves as the value of K is increased to 8k due to increase in diversity as well as hardness of examples. It is stable in a broad 4k-32k regime. Increasing K further introduces a lot of labeling noise in ^D and the accuracy drops.
- The images at the top of ranking are simple and clean without much labelling noise.
- They become progressively less obvious positives as we go down in the ranking.
3.2. IG-1B-Targeted for Semi-Weakly-Supervised Learning
- With hashtags as labels, weakly supervised learning is performed.
- Leveraging weakly-supervised data to pre-train the teacher model significantly improves the results.
3.3. Video Classification
- The popular multi-class Kinetics video benchmark which has 246k training videos and 400 human action labels. The models are evaluated on the 20k validation videos.
- Similar to IG-1B-Targeted, an IG-Kinetics of 65 million videos is constructed by leveraging 359 hashtags that are semantically relevant to Kinetics label space.
- The teacher is a weakly-supervised R(2+1)D-34 model with clip length 32. It is pre-trained with IG-Kinetics dataset and fine-tuned with labeled Kinetics videos.
- 10 clips are uniformly sampled from each video and the softmax predictions are averaged to produce video level predictions.
- For the proposed approach, we use K = 4k and P = 4 and IG-Kinetics as unlabeled data U to train student models.
- The proposed approach gives significant improvements over fully-supervised training. further gains are observed over the competitive weakly-supervised pretraining approach with models having lower FLOPS benefiting the most.
3.4. Transfer Learning
- Two transfer learning settings are investigated: (1) full-ft involves fine-tuning the full network, and, (2) fc-only involves extracting features from the final fc layer and training a logistic regressor.
- Results are particularly impressive for fc-only setting, where the proposed semi-weakly supervised model outperforms highly competitive weakly-supervised model by 6.7%.
Reference
[2019 arXiv] [Billion-Scale]
Billion-Scale Semi-Supervised Learning for Image Classification
Semi-Supervised Learning
2017 [Mean Teacher] 2019 [Billion-Scale]