Review — Rethinking ImageNet Pre-training (Object Detection, Semantic Segmentation)
Training From Scratch Not Worse Than ImageNet Pre-Training

In this story, Rethinking ImageNet Pre-training, by Facebook AI Research (FAIR), is briefly reviewed.
Pre-training have been used over training from scratch for many papers. However, is the pre-trained knowledge really useful when transferred to other computer vision tasks?
In this story, some facts are discovered:
- Training from random initialization is surprisingly robust, the results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics.
- ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy.
This is a paper in 2019 ICCV with over 350 citations. (Sik-Ho Tsang @ Medium)
(There are many details on the experimental setup to make the experiment fair. I would skip some of the details and results to make the story short. If interested, please free feel to visit the paper.)
Outline
- Number of Training Images & Setup
- Training from Scratch to Match Accuracy
- Training from Scratch with Less Data
- Discussions
1. Number of Training Images & Setup
1.1. Number of Training Images Involved

- Typical ImageNet pre-training involves over one million images iterated for one hundred epochs. In addition to any semantic information learned from this large-scale data, the pre-training model has also learned low-level features.
- On the other hand, when training from scratch the model has to learn low- and high-level semantics, so more iterations may be necessary for it to converge well.
- As shown above, if counting image-level samples, the from-scratch case sees considerably fewer samples than its fine-tuning counterpart.
- Actually, the sample numbers only get closer if we count pixel-level samples.
1.2. Setup
- Mask R-CNN with ResNet, and ResNeXt plus Feature Pyramid Network (FPN) backbones are used.
- GN/SyncBN is used to replace all ‘frozen BN’. SyncBN means using BN under multiple GPUs.
- The models are fine-tuned with 90k iterations (namely, ‘1× schedule’) or 180k iterations (‘2× schedule’) to a so-called ‘6× schedule’ which has 540k iterations.
2. Training from Scratch to Match Accuracy

- Typical fine-tuning schedules (2×) work well for the models with pre-training to converge to near optimum. But these schedules are not enough for models trained from scratch, and they appear to be inferior if they are only trained for a short period.
Models trained from scratch can catch up with their fine-tuning counterparts, if a 5× or 6× schedule is used. When they converge to an optimum, their detection AP is no worse than their fine-tuning counterparts.
3. Training from Scratch with Less Data

- Smaller training set of 10k COCO images (i.e., less than 1/10th of the full COCO set) is used.
- The model with pre-training reaches 26.0 AP with 60k iterations, but has a slight degradation when training more.
The counterpart model trained from scratch has 25.9 AP at 220k iterations, which is comparably accurate.
4. Discussions
- The above experiments also bring the below discussions by authors.
4.1. Is ImageNet pre-training necessary?
- No, if we have enough target data.
- This suggests that collecting annotations of target data (instead of pretraining data) can be more useful for improving the target task performance.
4.2. Is ImageNet Useful?
- Yes.
- ImageNet pre-training reduces research cycles, leading to easier access to encouraging results, and fine-tuning from pretrained weights converges faster than from scratch.
4.3. Is Big Data Helpful?
- Yes.
- But a generic large-scale, classification-level pre-training set is not ideal if we take into account the extra effort of collecting and cleaning data.
- If the gain of large-scale classification-level pre-training becomes exponentially diminishing, it would be more effective to collect data in the target domain.
4.4. Shall We Pursuit Universal Representations?
- Yes.
- Authors believe learning universal representations is a laudable goal.
- The study suggests that the community should be more careful when evaluating pre-trained features.
Reference
[2019 ICCV] [Rethinking ImageNet Pre-training]
Rethinking ImageNet Pre-training
Object Detection
2014: [OverFeat] [R-CNN]
2015: [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net]
2016: [CRAFT] [R-FCN] [ION] [MultiPathNet] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [SSD] [YOLOv1]
2017: [NoC] [G-RMI] [TDM] [DSSD] [YOLOv2 / YOLO9000] [FPN] [RetinaNet] [DCN / DCNv1] [Light-Head R-CNN]
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet]
2019: [DCNv2] [Rethinking ImageNet Pre-training]
Instance Segmentation
2014–2015: [SDS] [Hypercolumn] [DeepMask]
2016: [SharpMask] [MultiPathNet] [MNC] [InstanceFCN]
2017: [FCIS] [Mask R-CNN]
2018: [MaskLab] [PANet]
2019: [DCNv2] [Rethinking ImageNet Pre-training]