Reading: Cascade R-CNN — Delving into High Quality Object Detection (Object Detection)

Outperforms YOLOv2, SSD, RetinaNet, Faster R-CNN, FPN, G-RMI, R-FCN, DCNv1 and Mask R-CNN

5 min readMay 20, 2020

In this story, Cascade R-CNN, by UC San Diego, is briefly described. Prior deep learning object detectors’ performance tends to degrade with increasing the IoU (Intersection over Union) thresholds. They usually suffer from two main factors:

Overfitting during training, due to exponentially vanishing positive samples, i.e. lot of positive samples are gone when IoU threshold increases.
Inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. e.g.: training at higher(lower) IoU threshold but test at lower(higher) IoU threshold.

In this paper, Cascade R-CNN, by extending Faster R-CNN, is proposed to solve the above problems. And it is published in 2018 CVPR with over 450 citations. (Sik-Ho Tsang @ Medium)

Outline

Prior Art Network Architectures
Cascade R-CNN Network Architecture
Experimental Results

1. Prior Art Network Architectures

(a) Faster R-CNN: The first stage is a proposal sub-network (“H0”), applied to the entire image, to produce preliminary detection hypotheses, known as object proposals.
In the second stage, these hypotheses are then processed by a region-of-interest detection sub-network (“H1”), denoted as detection head. A final classification score (“C1”) and a bounding box (“B1”) are assigned to each hypothesis.
But B1 is not accurate enough.
(b) Iterative BBox at Inference: Some prior arts think that the bounding box (B1) in (a) obtained, is not accurate enough. Thus, B1 is input into the same H1 again to regress the bounding box to obtain B2, and so on. This iterative approach attempts to gradually fine-tune the bounding box to obtain a more accurate one.

However, all heads (f in the equation, or H1 in the figure) are the same, the same H1 is used again and again. The improved performance is limited. Usually no improvement beyond twice.
Also, during training, there is only one H1, while during inference, there are multiple H1, which causes mismatch between training and testing.
It is more like a post-processing step.
(c) Integral Loss: Different heads are used. The classifiers are ensembled during inference.

However, high quality classifiers are prone to overfitting. Also, those high quality classifiers are required to process proposals of overwhelming low quality at inference, for which they are not optimized.

(If interested, please read the original paper. More details are mentioned.)

2. Cascade R-CNN Network Architecture

Different from iterative bbox that using the same H1, different heads are used at different stages, i.e. H1, H2, H3 are used as shown in the figure above, or f1, … fT-1, FT as shown in the equation above.
Each of them is designed for one specific IoU threshold from small to large.
The cascaded regression is a resampling procedure, not post-processing step, providing good positive samples to the next stage.
And there is no discrepancy between training and inference since the architecture and IoU thresholds are the same during training and inference.

3. Experimental Results

MSCOCO 2017 dataset is used, which contains ∼118k images for training, 5k for validation (val) and ∼20k for testing without provided annotations (test-dev).

3.1. Performance of Different Stages

For the first stage, u=0.5 achieves the highest AP at low IoU threshold u. It is normal because the first stage is trained using u=0.5.
For the second stage, u=0.6 achieves the highest AP at nearly all IoU threshold u.
For the third stage, u=0.7 obtains the highest AP at high IoU threshold.

But when there are 4 stages, performance unchanges, or drops, though they can obtain the highest APs at AP80 and AP90.
Thus, the best trade off is 3-stage cascade R-CNN.

3.2. Comparison with Iterative BBox & Integral Loss

**Left: Iterative BBox, Right: Integral Loss**

Compared with iterative BBox, 3rd stage Cascade R-CNN has better overall localization performance for the whole curve with different input IoUs.
For 1st stage, both of them have similar AP since there is no bounding box refinement yet.
Compared with integral loss, ensemble doesn’t help much compared with single IoU threshold result.

For the overall AP, cascade R-CNN outperforms both by large margin.

3.3. Comparison with SOTA Approaches

Cascade R-CNN outperforms YOLOv2, SSD, RetinaNet, Faster R-CNN, FPN, G-RMI, DCNv1 and Mask R-CNN by large margin.
Mask R-CNN uses segmentation mask to help the object detection while Cascade R-CNN no needs.

3.4. Cascade R-CNN with Different Backbones and Detectors

Cascade R-CNN architecture can be applied to R-FCN and FPN as well.
For different backbones and detectors, with the use of Cascade R-CNN, overall AP is improved.

During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 28th story in this month. 2 stories to go. Thanks for visiting my story..

Reference

[2018 CVPR] [Cascade R-CNN]
Cascade R-CNN: Delving into High Quality Object Detection

Object Detection

[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN / DCNv1] [Cascade R-CNN] [DCNv2]