Review: G-RMI — Winner in 2016 COCO Detection (Object Detection)
A Guide to Select a Detection Architecture: Faster R-CNN, R-FCN and SSD
This time, G-RMI, Google Research and Machine Intelligence, who won the 1st place in 2016 MS COCO detection challenge is reviewed. G-RMI is the team name attending the challenge. It is not a name for a proposed approach because they do not have any innovative ideas such as modifying the deep learning architecture to win the challenge. The paper named “Speed/accuracy trade-offs for modern convolutional object detectors” also gives us some hints that, they systematically investigated on different kinds of object detectors and feature extractors. Specifically:
- 3 Object Detectors (meta-architecture): Faster R-CNN, R-FCN, and SSD
- 6 Feature Extractors: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNet
They also analysed the effects of other parameters such as input image sizes and number of region proposals. Finally, an ensemble of several models achieved the state-of-the-art results and won the challenge. And it is published in 2017 CVPR with more than 400 citations. (Sik-Ho Tsang @ Medium)
Outline
- Meta-architectures
- Feature Extractors
- Accuracy vs Time
- Effect of Feature Extractor
- Effect of Object Size
- Effect of Image Size
- Effect of the Number of Proposals
- FLOPs Analysis
- Memory Analysis
- Good localization at .75 IOU means good localization at all IOU thresholds
- State-of-the-art Detection Results on COCO
1. Meta-architectures
The object detectors are named as meta-architectures here. Three meta-architectures are investigated: Faster R-CNN, R-FCN, and SSD.
- It uses a single feed-forward convolutional network to directly predict classes and anchor offsets without requiring a second stage per-proposal classification operation.
- In the first stage, called the region proposal network (RPN), images are processed by a feature extractor (e.g., VGG-16), features at some selected intermediate level (e.g., “conv5”) are used to predict class-agnostic box proposals.
- In the second stage, these (typically 300) box proposals are used to crop features from the same intermediate feature map (ROI pooling) which are subsequently fed to the remainder of the feature extractor (e.g., “fc6” followed by “fc7”) in order to predict a class and class-specific box refinement for each proposal.
- Similar to Faster R-CNN, there is RPN in the first stage.
- In the second stage, positive-sensitive score maps are used such that crops (ROI pooling) are taken from the last layer of features prior to prediction. This makes the per-ROI operation cost become very low as nearly all operations are shared before ROI pooling.
- Thus, it achieves comparable accuracy to Faster R-CNN often at faster running time.
2. Feature Extractors
Six feature extractors are tried: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNetV1.
- Different feature extractors, different layer is used for extracting features for object detection.
- Some modifications are made such as, dilated convolutions are used, or making max pooling stride smaller, for some feature extractors so that the stride size is not too small after feature extraction.
3. Accuracy vs Time
- Colors: Feature Extractors
- Marker shapes: Meta-architectures
3.1. General Observations
- R-FCN and SSD are faster on average.
- Faster R-CNN is slower but more accurate, requires at least 100ms per image.
3.2. Critical Points on Optimality Frontier
- SSDs with Inception-v2 and MobileNet are most accurate of the fastest models.
- Ignoring post-processing costs, MobileNet seems to be roughly twice as fast as Inception-v2 while being slightly worse in accuracy.
Sweet Spot: R-FCN w/ResNet or Faster R-CNN w/ResNet and only 50 proposals
- There is an “elbow” in the middle of the optimality frontier occupied by R-FCN models using ResNet feature extractors.
- This is the best balance between speed and accuracy among the model configurations.
Most Accurate: Faster R-CNN w/Inception-ResNet at stride 8
- Faster R-CNN with dense output Inception-ResNet-v2 models attain the best possible accuracy on our optimality frontier.
- Yet, these models are slow, requiring nearly a second of processing time.
4. Effect of Feature Extractor
- Intuitively, stronger performance on classification should be positively correlated with stronger performance on COCO detection.
- This correlation appears to only be significant for Faster R-CNN and R-FCN while the performance of SSD appears to be less reliant on its feature extractor’s classification accuracy.
5. Effect of Object Size
- All methods do much better on large objects.
- SSDs typically have (very) poor performance on small objects, but still SSDs are competitive with Faster R-CNN and R-FCN on large objects.
- And later on, there is DSSD to address the small object detection issue.
6. Effect of Image Size
- Decreasing resolution by a factor of two in both dimensions consistently lowers accuracy (by 15.88% on average) but also reduces inference time by a relative factor of 27.4% on average.
- High resolution inputs allow for small objects to be resolved.
- High resolution models lead to significantly better mAP results on small objects (by a factor of 2 in many cases) and somewhat better mAP results on large objects as well.
7. Effect of the Number of Proposals
We can output different number of proposals at RPN (the first stage). Fewer proposals, faster running time, or vice versa.
Faster R-CNN
- Inception-ResNet, which has 35.4% mAP with 300 proposals can still have surprisingly high accuracy (29% mAP) with only 10 proposals.
- The sweet spot is probably at 50 proposals, where we are able to obtain 96% of the accuracy of using 300 proposals while reducing running time by a factor of 3.
R-FCN
- The computational savings from using fewer proposals in the R-FCN setting are minimal.
- This is not surprising because as mentioned, per-ROI computation cost is low for R-FCN due to shared computation by positive-sensitive score maps.
Comparison between Faster R-CNN and R-FCN
- At 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet becomes roughly comparable to that of equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.
8. FLOPs Analysis
- For denser block models such as ResNet-101, FLOPs/GPU time is typically greater than 1.
- For Inception and MobileNet models, this ratio is typically less than 1.
- Perhaps, factorization reduces FLOPs, but adds more overhead in memory I/O or potentially that current GPU instructions (cuDNN) are more optimized for dense convolution.
9. Memory Analysis
- High correlation with running time with larger and more powerful feature extractors requiring much more memory.
- As with speed, MobileNet is the cheapest, requiring less than 1Gb (total) memory in almost all settings.
10. Good localization at .75 IOU means good localization at all IOU thresholds
- Both mAP@.5 and mAP@.75 performances are almost perfectly linearly correlated with mAP@[.5:.95].
- mAP@.75 is slightly more tightly correlated with mAP@[.5:.95] (with R² > 0.99), so if we were to replace the standard COCO metric with mAP at a single IOU threshold, IOU=.75 is likely to be chosen.
11. State-of-the-art Detection Results on COCO
11.1. Ensembling and Multicrop
- Since mAP is the main objective in COCO detection challenges, the most accurate though time-consuming Faster R-CNN is considered.
- The diverse results encouraging ensembling.
- G-RMI: With the above 5 models ensembled and multicrop yielded the final model. It outperforms the winner in 2015 and 2nd place in 2016.
- The winner in 2015 uses ResNet + Faster R-CNN + NoCs. (Please read my review about the COCO challenge results in NoCs.)
- Trimps-Soushen, 2nd place in 2016, uses Faster R-CNN + ensemble multiple models + improvements from other papers. (There is no details about Trimps-Soushen on COCO challenge.)
- Note: There is no multiscale training, horizontal flipping, box refinement, box voting, or global context.
- 2nd Row: 6 Faster RCNN models with 3 ResNet-101 and 3 Inception-ResNet-v2.
- 3rd Row: Diverse ensemble results as in the first table in this section.
- Thus, it is encouraging for diversity, which did help much compared with using a hand selected ensemble.
- And ensembling and multicrop were responsible for almost 7 points of improvement over a single model.
11.2. Detections from 5 Different Models
References
[2017 CVPR] [G-RMI]
Speed/accuracy trade-offs for modern convolutional object detectors
My Related Reviews
Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet]
Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000]
Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet]
Instance Segmentation
[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN]