Review — Billion-Scale Semi-Supervised Learning for Image Classification

Teacher Student Model for Semi-Supervised Learning Using 1 Billion Unlabeled Data

5 min readJan 22, 2022

Billion-Scale Semi-Supervised Learning for Image Classification
Billion-Scale, by Facebook AI
2019 arXiv, Over 200 Citations (Sik-Ho Tsang @ Medium)
Teacher Student Model, Semi-Supervised Learning, Image Classification, Video Classification

A semi-supervised learning based on teacher/student paradigm is proposed, which leverages a large collection of unlabelled images (up to 1 billion). By semi-supervised learning, vanilla ResNet-50 achieves 81.2% top-1 accuracy on ImageNet benchmark.
A list of recommendations is suggested for large-scale semi-supervised learning.

Outline

A List of Recommendations for Semi-Supervised Learning
Proposed Teacher/Student Paradigm for Semi-Supervised Learning
Experimental Results

1. A List of Recommendations for Semi-Supervised Learning

Train with a teacher/student paradigm: It produces a better model for a fixed complexity, even if the student and teacher have the same architecture.
Fine-tune the model with true labels only.
Large-scale unlabelled dataset is key to performance.
Use a large number of pre-training iterations (as opposed to a vanilla supervised setting, where a number of epochs as used in common practice is sufficient).
Build a balanced distribution for inferred labels.
Pre-training the (high-capacity) teacher model by weak supervision (tags) further improves the results.

The above recommendations are made based on the findings in this paper.

2. Proposed Teacher/Student Paradigm for Semi-Supervised Learning

2.1. Procedures

The semi-supervised strategy is depicted in the above figure, also in the animated GIF at the top.

Train on the labeled data to get an initial teacher model.
For each class/label, the predictions of this teacher model are used to rank the unlabeled images and pick top-K images to construct a new training data.
For each image, only the classes associated with the P highest scores are retained.
This new training data is used to train a student model, which typically differs from the teacher model: the complexity at test time is smaller.
Finally, pre-trained student model is fine-tuned on the initial labeled data to circumvent potential labeling errors.

2.2. Some Practical Details

Unlabeled dataset U: 1) YFCC-100M [38] is a publicly available dataset of about 90 million images from Flickr. 2)IG-1B-Targeted: Following [27], a dataset of 1B public images with associated hashtags from a social media website.
Labeled dataset D: The standard ImageNet with 1000 classes.
Models: ResNet and ResNeXt.
Training: 64 GPUs across 8 machines. Each GPU processes 24 images at a time. Batch normalization is used for each conv layer on each GPU. Thus, overall minibatch size is of 64×24 = 1536.

3. Experimental Results

3.1. YFCC-100M as Unlabelled Dataset

**ImageNet1k-val top-1 accuracy for students models**

Teacher model brings a significant improvement over the supervised baseline for various capacity target models (1.6-2.6%).
Fine-tuning the model on clean labeled data is crucial to achieve good performance.

**Varying the teacher capacity for training a** **ResNet-50 student model with our approach.**

The accuracy of ResNet-50 student model improves as increasing the strength of teacher model until ResNeXt-101 32×16. Increasing the teacher model capacity further has no effect on the performance of the student model.
Interestingly, even in a case of where both teacher and student models are ResNet-50, an improvement of around 1% over the supervised baseline is obtained.

**Self-training**: top-1 accuracy of ResNet and ResNeXt models self-trained on the YFCC dataset.

Self-training when the teacher and student models have the same architecture and capacity.
Higher capacity models have relatively higher accuracy gains.

**Left**: Size of the unlabeled dataset U. **Middle**: Effect of number of training iterations. **Right**: Sampling hyperparameter K.

Left: A fixed accuracy improvement is achieved every time the dataset size is doubled until reaching the dataset size of 25M.
Middle: The performance keeps on improving as the number of processed images is increased.
Right: The performance first improves as the value of K is increased to 8k due to increase in diversity as well as hardness of examples. It is stable in a broad 4k-32k regime. Increasing K further introduces a lot of labeling noise in ^D and the accuracy drops.

**Examples of images from YFCC100M collected by our procedure for the classes “Tiger shark”, “Leaf beetle” and “American black bear” for a few ranks.**

The images at the top of ranking are simple and clean without much labelling noise.
They become progressively less obvious positives as we go down in the ranking.

3.2. IG-1B-Targeted for Semi-Weakly-Supervised Learning

**Semi-Weakly-Supervised Learning Procedures**

**State of the art on ImageNet with standard architectures (ResNet**, **ResNeXt**).

With hashtags as labels, weakly supervised learning is performed.
Leveraging weakly-supervised data to pre-train the teacher model significantly improves the results.

3.3. Video Classification

**Accuracy on Kinetics video dataset for different approaches using R(2+1)D models**

The popular multi-class Kinetics video benchmark which has 246k training videos and 400 human action labels. The models are evaluated on the 20k validation videos.
Similar to IG-1B-Targeted, an IG-Kinetics of 65 million videos is constructed by leveraging 359 hashtags that are semantically relevant to Kinetics label space.
The teacher is a weakly-supervised R(2+1)D-34 model with clip length 32. It is pre-trained with IG-Kinetics dataset and fine-tuned with labeled Kinetics videos.
10 clips are uniformly sampled from each video and the softmax predictions are averaged to produce video level predictions.
For the proposed approach, we use K = 4k and P = 4 and IG-Kinetics as unlabeled data U to train student models.
The proposed approach gives significant improvements over fully-supervised training. further gains are observed over the competitive weakly-supervised pretraining approach with models having lower FLOPS benefiting the most.

3.4. Transfer Learning

**CUB2011: Transfer learning accuracy (ResNet-50).**

Two transfer learning settings are investigated: (1) full-ft involves fine-tuning the full network, and, (2) fc-only involves extracting features from the final fc layer and training a logistic regressor.
Results are particularly impressive for fc-only setting, where the proposed semi-weakly supervised model outperforms highly competitive weakly-supervised model by 6.7%.

Reference

[2019 arXiv] [Billion-Scale]
Billion-Scale Semi-Supervised Learning for Image Classification

Semi-Supervised Learning

2017 [Mean Teacher] 2019 [Billion-Scale]

Review — Billion-Scale Semi-Supervised Learning for Image Classification

Teacher Student Model for Semi-Supervised Learning Using 1 Billion Unlabeled Data

Outline

1. A List of Recommendations for Semi-Supervised Learning

2. Proposed Teacher/Student Paradigm for Semi-Supervised Learning

2.1. Procedures

2.2. Some Practical Details

3. Experimental Results

3.1. YFCC-100M as Unlabelled Dataset

3.2. IG-1B-Targeted for Semi-Weakly-Supervised Learning

3.3. Video Classification

3.4. Transfer Learning

Reference

Semi-Supervised Learning

My Other Previous Paper Readings

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet

More from Sik-Ho Tsang

Review — YOLOv12: Attention-Centric Real-Time Object Detectors

YOLOv12, Outperforms YOLOv11, YOLOv10, YOLOv9, RT-DETR

Brief Review: YOLOv5 for Object Detection

Brief Explanation of YOLOv5, It Outperforms EfficientDet

Review: DeepLabv3+ — Atrous Separable Convolution (Semantic Segmentation)

Outperforms LC, ResNet-DUC-HDC, GCN, RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3

Review — Pre-LN Transformer: On Layer Normalization in the Transformer Architecture

Pre-LN Transformer, Warm-Up Stage is Skipped

Recommended from Medium

Understanding and Implementing Faster R-CNN

Most of the current SOTA models are built on top of the groundwork laid by the Faster-RCNN model. Faster R-CNN is an object detection model…

Comprehensive Guide to Real-Time Car License Plate Detection with YOLO, .bt

License plate detection has broad applications, from automated traffic management to secure entry systems. This guide walks you through…

Brief Review — EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Going Faster with Vision Transformers

YOLOv12: Redefining Real-Time Object Detection 🚀

Introducing the Pioneering Features and Performance of YOLOv12 from the Latest Research

Friendly Introduction to Deep Learning Architectures (CNN, RNN, GAN, Transformers, Encoder-Decoder…

This blog aims to provide a friendly introduction to deep learning architectures involving Convolutional Neural Networks (CNN), Recurrent…

Comparison Between CLIP and BLIP Models

In recent years, vision-language models like CLIP (Contrastive Language-Image Pretraining)¹ and BLIP (Bootstrapped Language-Image…