Review — WSL: Exploring the Limits of Weakly Supervised Pretraining
Obtains Images Using Instagram Hashtags for Weakly Supervised Pretraining
5 min readJan 23, 2022
Exploring the Limits of Weakly Supervised Pretraining
WSL, by Facebook
2018 ECCV, Over 700 Citations (Sik-Ho Tsang @ Medium)
Weakly-Supervised Learning, Image Classification, Object Detection
- Datasets are expensive to collect and annotate.
- A unique study is investigated on transfer learning with large convolutional networks trained to predict hashtags on billions of social media images.
Outline
- Scaling Up Supervised Pretraining
- Experimental Results
1. Scaling Up Supervised Pretraining
1.1. Pipeline
- A set of hashtags is selected.
- Images are downloaded that are tagged with at least one of these hashtags.
- Then, because multiple hashtags may refer to the same underlying concept, we apply a simple process that utilizes WordNet [20] synsets to merge some hashtags into a single canonical form (e.g., #brownbear and #ursusarctos are merged).
- Finally, for each downloaded image, each hashtag is replaced with its canonical form and any hashtags that were not in the selected set are discarded. The canonical hashtags are used as labels for training and evaluation.
1.2. Datasets
- Each dataset is named with a template, role-source-I-L, that indicates its role (training, validation, testing), source, number of images I, and number of labels L.
- Three hashtag sets for the Instagram data:
- A ∼1.5k set with hashtags from the standard 1,000 IN-1k synsets (each synset contains at least one synonym, hence there are more hashtags than synsets).
- A ∼17k set with hashtags that are synonyms in any of the noun synsets in WordNet.
- An ∼8.5k set with the most frequent hashtags from the 17k set. The hashtag set sizes are measured after merging the hashtags into their canonical forms.
1.3. Deduplication
- It is hypothesized that the first set has a visual distribution similar to IN-1k, while the other two represent more general visual distributions covering fine-grained visual categories.
- For instance, ∼5% of the images in the val-CUB-6k-200 set [21] also appear in train-IN-1M-1k, and 1.78% of images in val-IN-50k-1k set are in the JFT-300M training set [17].
- Manual deduplication is performed to remove overlapping images.
1.4. Model
- ResNeXt model is used. The experiments use ResNeXt-101 32×Cd, which has 101 layers, 32 groups, and group widths C of: 4 (8B multiply-add FLOPs, 43M parameters), 8 (16B, 88M), 16 (36B, 193M), 32 (87B, 466M), and 48 (153B, 829M).
- The model computes probabilities over all hashtags in the vocabulary using a softmax activation and is trained to minimize the cross-entropy between the predicted softmax distribution and the target distribution of each image. The target is a vector with k non-zero entries each set to 1/k corresponding to the k≥1 hashtags for the image.
- The models are trained by synchronous stochastic gradient descent (SGD) on 336 GPUs across 42 machines with minibatches of 8,064 images.
- Each GPU processes 24 images at a time and batch normalization (BN) [27] statistics are computed on these 24 image sets.
- ResNeXt-101 32×16d networks took ∼22 days to train on 3.5B images.
2. Experimental Results
2.1. Image Classification
- A network pretrained on nearly 1B Instagram images with 1.5k hashtags achieves a state-of-the-art accuracy of 84.2% — an improvement of 4.6% over the same model architecture trained on IN-1k alone and a 1.5% boost over the prior state-of-the-art [31].
- On the CUB2011 and Places365 target tasks, source models trained with the largest hashtag sets perform the best, likely, because the 17k hashtags span more objects, scenes, and fine-grained categories.
- The number of Instagram training images (x-axis; note the log scale) ranging from 3.5M to 3.5B images.
- Each time we multiply the amount of training data by a factor of x, we observe a fixed increase y in classification accuracy.
- The accuracy increase y is larger for higher-capacity networks.
- The highest accuracies on val-IN-1k are 83.3% (source: IG-940M-1k) and 83.6% (source: IG-3.5B-17k), both with ResNeXt-101 32×16d.
- By comparison, when training from scratch on IN-1k, top-1 accuracy saturates at around 79.6%.
- With large-scale Instagram hashtag training, transfer-learning performance appears bottlenecked by model capacity.
2.2. Object Detection
- Mask R-CNN with ResNeXt-101 FPN is used.
- When using large amounts of pretraining data, detection is model capacity bound: with the lowest capacity model (32×4d), the gains from larger datasets are small or even negative, but as model capacity increases the larger pretraining datasets yield consistent improvements. Even larger models are needed to take advantage of the large-scale pretraining data.
- AP: AP emphasizes precise localization while AP@50 allows for looser localization.
- AP@50: the improvement over IN-{1k, 5k} pretraining from IG-1B-1k is much larger in terms of AP@50.
- Thus, the gains from Instagram pretraining may be primarily due to improved object classification performance, rather than spatial localization performance.
Reference
[2018 ECCV] [WSL]
Exploring the Limits of Weakly Supervised Pretraining
Weakly/Semi-Supervised Learning
2017 [Mean Teacher] 2018 [WSL] 2019 [Billion-Scale]