Reading: Light-Head R-CNN — In Defense of Two-Stage Object Detector (Object Detection)

outperforms R-FCN, Faster R-CNN, DCNv1, G-RMI, FPN, Mask R-CNN, RetinaNet, YOLOv2 &SSD

5 min readOct 5, 2020

**Light-Head R-CNN Has Higher AP with Shorter Inference Time**

In this story, “Light-Head R-CNN: In Defense of Two-Stage Object Detector” (Light-Head R-CNN), by Tsinghua University, and Megvii Inc. (Face++), is shortly presented.

Faster R-CNN & R-FCN both have a heavy head to have the object detection, which makes the process slow.
In this paper, a light-head R-CNN is introduced to speed up the inference.

This is a paper in 2017 arXiv with over 100 citations. (Sik-Ho Tsang @ Medium)

Outline

Faster R-CNN Heavy Head
R-FCN Heavy Head
Light-Head R-CNN
Experimental Results

1. Faster R-CNN Heavy Head

Faster R-CNN adopts a powerful R-CNN which utilizes two large fully connected layers or whole ResNet stage 5 as a second stage classifier.
However, the computation could be intensive especially when the number of object proposals is large.

2. R-FCN Heavy Head

To speed up RoI-wise subnet, R-FCN first produces a set of score maps for each region, whose channel number will be #classes×p×p (p is the followed pooling size), and then pool along each RoI and average vote the final prediction.
But it involves more computation on RoI shared score maps generation.
Both Faster R-CNN and R-FCN have heavy head but at different positions.

3. Light-Head R-CNN

3.1. Thin Feature Maps

In Light-Head R-CNN, authors propose to generate the feature maps with small channel number (thin feature maps) maps), followed by conventional RoI warping.
It is found that RoI warping on thin feature maps will not only improves the accuracy but also saves memory and computation during training and inference.
Two settings: 1) setting “L” to validate the performance our algorithm when integrated with a large backbone network; 2) setting “S” to validate the effectiveness and efficiency of our algorithm when uses a small backbone network.

3.2. Basic feature extractor

For the setting L, we adopt ResNet-101 as the basic feature extractor.
On the other hand, the Xception-like small base model is utilized for the setting S.

3.3. Large separable convolution

**Large separable convolution performs a k×1 and 1×k convolution sequentially.**

Large separable convolution layers are applied on C5, as shown above.
k is set to 15, Cmid = 64 for setting S, and Cmid = 256 for setting L.
Cout is reduced to 10×p×p which is extremely small compared with limited #classes×p×p used in R-FCN.

3.4. R-CNN subnet

A single fully-connected layer with 2048 channels is employed in R-CNN subnet, followed by two sibling fully connected layer to predict RoI classification and regression.

4. Experiment Results

4.1. Thin Feature Maps

**The impact of reducing feature map channels for RoI warping.**

MS COCO detection is used for comparison.
B1 is R-FCN, B2 is R-FCN with enhanced settings. (If interested, please feel free to read the paper.)
The feature map channels is reduced to 490 (10×7×7) for PSRoI pooling. Noticing it is quite different from the original R-FCN, which involves 3969 (81×7×7) channels.
With thin feature maps, the mmAP results are still comparable.
It is important to note that, with the Light-head R-CNN design, it enables to efficiently integrate feature pyramid network (FPN).

4.2. Large separable convolution

**The impact of enhanced thin feature map**

Compared with the results based on the reproduced R-FCN setting B2, the thin feature map produced by large kernel can improve the performance by 0.7 points.

4.3. R-CNN subnet

37.7% mmAP is achieved when combined large kernel feature maps and Light-Head R-CNN.

4.4. SOTA Comparison

With also multi-scale training, using FPN as backbone, 41.5% mAP is achieved by Setting L, outperforms R-FCN, Faster R-CNN, DCNv1, G-RMI, FPN, Mask R-CNN, RetinaNet and RetinaNet with multi-scale training.

**Representative results of our large “L” model**

**Comparisons of the fast detector results on COCO test-dev**

Setting S: A tiny Xception-like network is used. Atrous algorithm is abandoned in our fast models, because it involves much computation compared with small backbone. RPN convolution is set to 256 channels, which is half of original used in Faster R-CNN and R-FCN.
Light-head R-CNN gets 30.7 mmAP at 102 FPS on MS COCO, significantly outperforming the fast detectors like YOLOv2 and SSD.