Reading: Light-Head R-CNN — In Defense of Two-Stage Object Detector (Object Detection)

Light-Head R-CNN Has Higher AP with Shorter Inference Time
  • In this paper, a light-head R-CNN is introduced to speed up the inference.


  1. Faster R-CNN Heavy Head
  2. R-FCN Heavy Head
  3. Light-Head R-CNN
  4. Experimental Results

1. Faster R-CNN Heavy Head

Faster R-CNN
  • However, the computation could be intensive especially when the number of object proposals is large.

2. R-FCN Heavy Head

  • But it involves more computation on RoI shared score maps generation.
  • Both Faster R-CNN and R-FCN have heavy head but at different positions.

3. Light-Head R-CNN

Light-Head R-CNN

3.1. Thin Feature Maps

  • In Light-Head R-CNN, authors propose to generate the feature maps with small channel number (thin feature maps) maps), followed by conventional RoI warping.
  • It is found that RoI warping on thin feature maps will not only improves the accuracy but also saves memory and computation during training and inference.
  • Two settings: 1) setting “L” to validate the performance our algorithm when integrated with a large backbone network; 2) setting “S” to validate the effectiveness and efficiency of our algorithm when uses a small backbone network.

3.2. Basic feature extractor

  • For the setting L, we adopt ResNet-101 as the basic feature extractor.
  • On the other hand, the Xception-like small base model is utilized for the setting S.

3.3. Large separable convolution

Large separable convolution performs a k×1 and 1×k convolution sequentially.
  • k is set to 15, Cmid = 64 for setting S, and Cmid = 256 for setting L.
  • Cout is reduced to 10×p×p which is extremely small compared with limited #classes×p×p used in R-FCN.

3.4. R-CNN subnet

  • A single fully-connected layer with 2048 channels is employed in R-CNN subnet, followed by two sibling fully connected layer to predict RoI classification and regression.

4. Experiment Results

4.1. Thin Feature Maps

The impact of reducing feature map channels for RoI warping.
  • B1 is R-FCN, B2 is R-FCN with enhanced settings. (If interested, please feel free to read the paper.)
  • The feature map channels is reduced to 490 (10×7×7) for PSRoI pooling. Noticing it is quite different from the original R-FCN, which involves 3969 (81×7×7) channels.
  • With thin feature maps, the mmAP results are still comparable.
  • It is important to note that, with the Light-head R-CNN design, it enables to efficiently integrate feature pyramid network (FPN).

4.2. Large separable convolution

The impact of enhanced thin feature map

4.3. R-CNN subnet

The impact of light-head

4.4. SOTA Comparison

SOTA Comparison on COCO test-dev
Representative results of our large “L” model
Comparisons of the fast detector results on COCO test-dev
  • Light-head R-CNN gets 30.7 mmAP at 102 FPS on MS COCO, significantly outperforming the fast detectors like YOLOv2 and SSD.
Representative results of our large “S” model



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store