Review — Rethinking ImageNet Pre-training (Object Detection, Semantic Segmentation)

Training From Scratch Not Worse Than ImageNet Pre-Training

Sik-Ho Tsang

Published in

Nerd For Tech

4 min readFeb 21, 2021

**The model,** **ResNet50-FPN** **Using** GN, trained from random initialization needs more iterations to converge, but converges to a solution that is no worse than the fine-tuning counterpart.

In this story, Rethinking ImageNet Pre-training, by Facebook AI Research (FAIR), is briefly reviewed.

Pre-training have been used over training from scratch for many papers. However, is the pre-trained knowledge really useful when transferred to other computer vision tasks?

In this story, some facts are discovered:

Training from random initialization is surprisingly robust, the results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics.
ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy.

This is a paper in 2019 ICCV with over 350 citations. (Sik-Ho Tsang @ Medium)

(There are many details on the experimental setup to make the experiment fair. I would skip some of the details and results to make the story short. If interested, please free feel to visit the paper.)

Outline

Number of Training Images & Setup
Training from Scratch to Match Accuracy
Training from Scratch with Less Data
Discussions

1. Number of Training Images & Setup

1.1. Number of Training Images Involved

**Total numbers of images, instances, and pixels seen during all training iterations, for pre-training + fine-tuning (green bars) vs. from random initialization (purple bars).**

Typical ImageNet pre-training involves over one million images iterated for one hundred epochs. In addition to any semantic information learned from this large-scale data, the pre-training model has also learned low-level features.
On the other hand, when training from scratch the model has to learn low- and high-level semantics, so more iterations may be necessary for it to converge well.
As shown above, if counting image-level samples, the from-scratch case sees considerably fewer samples than its fine-tuning counterpart.
Actually, the sample numbers only get closer if we count pixel-level samples.

1.2. Setup

Mask R-CNN with ResNet, and ResNeXt plus Feature Pyramid Network (FPN) backbones are used.
GN/SyncBN is used to replace all ‘frozen BN’. SyncBN means using BN under multiple GPUs.
The models are fine-tuned with 90k iterations (namely, ‘1× schedule’) or 180k iterations (‘2× schedule’) to a so-called ‘6× schedule’ which has 540k iterations.

2. Training from Scratch to Match Accuracy

**Learning curves of APbbox on COCO val2017 using** **Mask R-CNN** **with R101-FPN** **and** GN

Typical fine-tuning schedules (2×) work well for the models with pre-training to converge to near optimum. But these schedules are not enough for models trained from scratch, and they appear to be inferior if they are only trained for a short period.

Models trained from scratch can catch up with their fine-tuning counterparts, if a 5× or 6× schedule is used. When they converge to an optimum, their detection AP is no worse than their fine-tuning counterparts.

3. Training from Scratch with Less Data

Smaller training set of 10k COCO images (i.e., less than 1/10th of the full COCO set) is used.
The model with pre-training reaches 26.0 AP with 60k iterations, but has a slight degradation when training more.

The counterpart model trained from scratch has 25.9 AP at 220k iterations, which is comparably accurate.

4. Discussions

The above experiments also bring the below discussions by authors.

4.1. Is ImageNet pre-training necessary?

No, if we have enough target data.
This suggests that collecting annotations of target data (instead of pretraining data) can be more useful for improving the target task performance.

4.2. Is ImageNet Useful?

Yes.
ImageNet pre-training reduces research cycles, leading to easier access to encouraging results, and fine-tuning from pretrained weights converges faster than from scratch.

4.3. Is Big Data Helpful?

Yes.
But a generic large-scale, classification-level pre-training set is not ideal if we take into account the extra effort of collecting and cleaning data.
If the gain of large-scale classification-level pre-training becomes exponentially diminishing, it would be more effective to collect data in the target domain.

4.4. Shall We Pursuit Universal Representations?

Yes.
Authors believe learning universal representations is a laudable goal.
The study suggests that the community should be more careful when evaluating pre-trained features.

Reference

[2019 ICCV] [Rethinking ImageNet Pre-training]
Rethinking ImageNet Pre-training

Object Detection

2014: [OverFeat] [R-CNN]
2015: [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net]
2016: [CRAFT] [R-FCN] [ION] [MultiPathNet] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [SSD] [YOLOv1]
2017: [NoC] [G-RMI] [TDM] [DSSD] [YOLOv2 / YOLO9000] [FPN] [RetinaNet] [DCN / DCNv1] [Light-Head R-CNN]
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet]
2019: [DCNv2] [Rethinking ImageNet Pre-training]

Instance Segmentation

2014–2015: [SDS] [Hypercolumn] [DeepMask]
2016: [SharpMask] [MultiPathNet] [MNC] [InstanceFCN]
2017: [FCIS] [Mask R-CNN]
2018: [MaskLab] [PANet]
2019: [DCNv2] [Rethinking ImageNet Pre-training]