Review: G-RMI — Winner in 2016 COCO Detection (Object Detection)

A Guide to Select a Detection Architecture: Faster R-CNN, R-FCN and SSD

Published in

Towards Data Science

8 min readJan 11, 2019

This time, G-RMI, Google Research and Machine Intelligence, who won the 1st place in 2016 MS COCO detection challenge is reviewed. G-RMI is the team name attending the challenge. It is not a name for a proposed approach because they do not have any innovative ideas such as modifying the deep learning architecture to win the challenge. The paper named “Speed/accuracy trade-offs for modern convolutional object detectors” also gives us some hints that, they systematically investigated on different kinds of object detectors and feature extractors. Specifically:

3 Object Detectors (meta-architecture): Faster R-CNN, R-FCN, and SSD
6 Feature Extractors: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNet

They also analysed the effects of other parameters such as input image sizes and number of region proposals. Finally, an ensemble of several models achieved the state-of-the-art results and won the challenge. And it is published in 2017 CVPR with more than 400 citations. (Sik-Ho Tsang @ Medium)

Outline

Meta-architectures
Feature Extractors
Accuracy vs Time
Effect of Feature Extractor
Effect of Object Size
Effect of Image Size
Effect of the Number of Proposals
FLOPs Analysis
Memory Analysis
Good localization at .75 IOU means good localization at all IOU thresholds
State-of-the-art Detection Results on COCO

1. Meta-architectures

The object detectors are named as meta-architectures here. Three meta-architectures are investigated: Faster R-CNN, R-FCN, and SSD.

SSD

It uses a single feed-forward convolutional network to directly predict classes and anchor offsets without requiring a second stage per-proposal classification operation.

Faster R-CNN

In the first stage, called the region proposal network (RPN), images are processed by a feature extractor (e.g., VGG-16), features at some selected intermediate level (e.g., “conv5”) are used to predict class-agnostic box proposals.
In the second stage, these (typically 300) box proposals are used to crop features from the same intermediate feature map (ROI pooling) which are subsequently fed to the remainder of the feature extractor (e.g., “fc6” followed by “fc7”) in order to predict a class and class-specific box refinement for each proposal.

R-FCN

Similar to Faster R-CNN, there is RPN in the first stage.
In the second stage, positive-sensitive score maps are used such that crops (ROI pooling) are taken from the last layer of features prior to prediction. This makes the per-ROI operation cost become very low as nearly all operations are shared before ROI pooling.
Thus, it achieves comparable accuracy to Faster R-CNN often at faster running time.

2. Feature Extractors

Six feature extractors are tried: VGG-16, ResNet-101, Inception-v2, Inception-v3, Inception-ResNet-v2 and MobileNetV1.

**Top-1 classification accuracy on ImageNet**

Different feature extractors, different layer is used for extracting features for object detection.
Some modifications are made such as, dilated convolutions are used, or making max pooling stride smaller, for some feature extractors so that the stride size is not too small after feature extraction.

3. Accuracy vs Time

**Accuracy vs Time, The dotted Line is Optimality Frontier**

**Test-dev performance of the “critical” points along our optimality frontier**

Colors: Feature Extractors
Marker shapes: Meta-architectures

3.1. General Observations

R-FCN and SSD are faster on average.
Faster R-CNN is slower but more accurate, requires at least 100ms per image.

3.2. Critical Points on Optimality Frontier

Fastest: SSD w/MobileNet

SSDs with Inception-v2 and MobileNet are most accurate of the fastest models.
Ignoring post-processing costs, MobileNet seems to be roughly twice as fast as Inception-v2 while being slightly worse in accuracy.

Sweet Spot: R-FCN w/ResNet or Faster R-CNN w/ResNet and only 50 proposals

There is an “elbow” in the middle of the optimality frontier occupied by R-FCN models using ResNet feature extractors.
This is the best balance between speed and accuracy among the model configurations.

Most Accurate: Faster R-CNN w/Inception-ResNet at stride 8

Faster R-CNN with dense output Inception-ResNet-v2 models attain the best possible accuracy on our optimality frontier.
Yet, these models are slow, requiring nearly a second of processing time.

4. Effect of Feature Extractor

**Accuracy of detector (mAP on COCO) vs accuracy of feature extractor**

Intuitively, stronger performance on classification should be positively correlated with stronger performance on COCO detection.
This correlation appears to only be significant for Faster R-CNN and R-FCN while the performance of SSD appears to be less reliant on its feature extractor’s classification accuracy.

5. Effect of Object Size

**Accuracy stratified by object size, meta-architecture and feature extractor, image resolution is fixed to 300**

All methods do much better on large objects.
SSDs typically have (very) poor performance on small objects, but still SSDs are competitive with Faster R-CNN and R-FCN on large objects.
And later on, there is DSSD to address the small object detection issue.

6. Effect of Image Size

Decreasing resolution by a factor of two in both dimensions consistently lowers accuracy (by 15.88% on average) but also reduces inference time by a relative factor of 27.4% on average.
High resolution inputs allow for small objects to be resolved.
High resolution models lead to significantly better mAP results on small objects (by a factor of 2 in many cases) and somewhat better mAP results on large objects as well.

7. Effect of the Number of Proposals

We can output different number of proposals at RPN (the first stage). Fewer proposals, faster running time, or vice versa.

Faster R-CNN

Inception-ResNet, which has 35.4% mAP with 300 proposals can still have surprisingly high accuracy (29% mAP) with only 10 proposals.
The sweet spot is probably at 50 proposals, where we are able to obtain 96% of the accuracy of using 300 proposals while reducing running time by a factor of 3.

R-FCN

The computational savings from using fewer proposals in the R-FCN setting are minimal.
This is not surprising because as mentioned, per-ROI computation cost is low for R-FCN due to shared computation by positive-sensitive score maps.

Comparison between Faster R-CNN and R-FCN

At 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet becomes roughly comparable to that of equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.

8. FLOPs Analysis

For denser block models such as ResNet-101, FLOPs/GPU time is typically greater than 1.
For Inception and MobileNet models, this ratio is typically less than 1.
Perhaps, factorization reduces FLOPs, but adds more overhead in memory I/O or potentially that current GPU instructions (cuDNN) are more optimized for dense convolution.

9. Memory Analysis

High correlation with running time with larger and more powerful feature extractors requiring much more memory.
As with speed, MobileNet is the cheapest, requiring less than 1Gb (total) memory in almost all settings.

10. Good localization at .75 IOU means good localization at all IOU thresholds

**Overall COCO mAP (@[.5:.95]) for all experiments plotted against corresponding mAP@.50IOU and mAP@.75IOU**

Both mAP@.5 and mAP@.75 performances are almost perfectly linearly correlated with mAP@[.5:.95].
mAP@.75 is slightly more tightly correlated with mAP@[.5:.95] (with R² > 0.99), so if we were to replace the standard COCO metric with mAP at a single IOU threshold, IOU=.75 is likely to be chosen.

11. State-of-the-art Detection Results on COCO

11.1. Ensembling and Multicrop

**Summary of 5 Faster R-CNN single models**

Since mAP is the main objective in COCO detection challenges, the most accurate though time-consuming Faster R-CNN is considered.
The diverse results encouraging ensembling.

**Performance on the 2016 COCO test-challenge dataset.**

G-RMI: With the above 5 models ensembled and multicrop yielded the final model. It outperforms the winner in 2015 and 2nd place in 2016.
The winner in 2015 uses ResNet + Faster R-CNN + NoCs. (Please read my review about the COCO challenge results in NoCs.)
Trimps-Soushen, 2nd place in 2016, uses Faster R-CNN + ensemble multiple models + improvements from other papers. (There is no details about Trimps-Soushen on COCO challenge.)
Note: There is no multiscale training, horizontal flipping, box refinement, box voting, or global context.

**Effects of ensembling and multicrop inference.**

2nd Row: 6 Faster RCNN models with 3 ResNet-101 and 3 Inception-ResNet-v2.
3rd Row: Diverse ensemble results as in the first table in this section.
Thus, it is encouraging for diversity, which did help much compared with using a hand selected ensemble.
And ensembling and multicrop were responsible for almost 7 points of improvement over a single model.