Review — RefineDet: Single-Shot Refinement Neural Network for Object Detection (Object Detection)

Outperforms CoupleNet, DCN, RetinaNet, G-RMI, FPN, TDM, Faster R-CNN, Fast R-CNN, OHEM, YOLOv2, YOLOv1, SSD, DSSD

Published in

Nerd For Tech

8 min readApr 5, 2021

In this story, Single-Shot Refinement Neural Network for Object Detection, (RefineDet), by Chinese Academy of Sciences, University of Chinese Academy of Sciences, and GE Global Research, is reviewed. In this paper:

RefineDet is proposed, which consists of two inter-connected modules, namely, the anchor refinement module (ARM) and the object detection module (ODM).
ARM filters out negative anchors to reduce search space for the classifier, and coarsely adjust the locations and sizes of anchors.
ODM takes the refined anchors as the input from ARM to further improve the regression and predict multi-class label.

This is a paper in 2018 CVPR with over 670 citations. (Sik-Ho Tsang @ Medium)

Outline

RefineDet: Network Architecture
Anchor Refinement Module (ARM)
Object detection module (ODM)
Transfer connection block (TCB)
Loss Function & Inference
Experimental Results

1. RefineDet: Network Architecture

Similar to SSD, RefineDet produces a fixed number of bounding boxes and the scores indicating the presence of different classes of objects in those boxes, followed by the non-maximum suppression (NMS) to produce the final result.
RefineDet is formed by two inter-connected modules, i.e., anchor refinement module (ARM) and the object detection module (ODM).
ILSVRC CLS-LOC pretrained VGG-16 and ResNet-101 are used as backbone.
(There are some small modifications/refinement at the end of the backbones, please feel free to read the paper.)

1.1. Two-Step Cascaded Regression

As mentioned, ARM filters out negative anchors to reduce search space for the classifier, and coarsely adjust the locations and sizes of anchors.
ODM takes the refined anchors as the input from ARM to further improve the regression and predict multi-class label.

1.2. Anchors Design and Matching

4 feature layers with the total stride sizes 8, 16, 32, and 64 pixels are used to handle different scales of objects.
Each feature layer is associated with one specific scale of anchors and 3 aspect ratios.
(This paper is highly related to SSD, please read SSD if interested.)

2. Anchor Refinement Module (ARM)

There are pre-defined anchor boxes in SSD with fixes locations, ratios and sizes. (Please feel free to read SSD if interested.)

ARM aims to remove negative anchors so as to reduce search space for the classifier and also coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor.

Specifically, n anchor boxes are associated with each regularly divided cell on the feature map.
At each feature map cell, four offsets of the refined anchor boxes are predicted.
Two confidence scores are used to indicating the presence of foreground objects in those boxes.

2.1. Negative Anchor Filtering

If its negative confidence is larger than a preset threshold (i.e. 0.99 empirically), the anchor box is discarded in training the ODM. It is quite sure that it is a background.
Thus, only the refined hard negative anchor boxes and refined positive anchor boxes are passed to train the ODM.

3. Object detection module (ODM)

After obtaining the refined anchor boxes, the refined anchor boxes are passed to the corresponding feature maps in the ODM.

ODM aims to regress accurate object locations and predict multi-class labels based on the refined anchors.

Specifically, c class scores and the 4 accurate offsets of objects relative to the refined anchor boxes are calculated, yielding c + 4 outputs for each refined anchor boxes to complete the detection task.

3.1. Hard Negative Mining

Hard negative mining is used to mitigate the extreme foreground-background class imbalance.
Some negative anchor boxes with top loss values are selected to make the ratio between the negatives and positives below 3 : 1, instead of using all negative anchors or randomly selecting the negative anchors in training.

4. Transfer connection block (TCB)

TCB converts the features from the ARM to the ODM for detection.
TCBs is to integrate large-scale context by adding the high-level features to the transferred features to improve detection accuracy.
To match the dimensions between them, the deconvolution operation is used to enlarge the high-level feature maps and sum them in the element-wise way.

5. Loss Function & Inference

5.1. Loss Function

Therefore, the loss function for RefineDet consists of two parts, i.e., the loss in the ARM and the loss in the ODM.
For the ARM, a binary class label (of being an object or not) is assigned to each anchor and regress its location and size simultaneously to get the refined anchor.
After that, the refined anchors with the negative confidence less than the threshold are passed to the ODM to further predict object categories and accurate object locations and sizes.
The loss function is:

where Narm and Nodm are the numbers of positive anchors in the ARM and ODM, respectively.
The binary classification loss Lb is the cross-entropy/log loss over two classes (object vs. not object)
The multi-class classification loss Lm is the softmax loss over multiple classes confidences.
Similar to Fast R-CNN, smooth L1 loss is used as the regression loss Lr.
[li* ≥ 1] means the regression loss is ignored for negative anchors.

5.2. Inference

ARM first filters out the regularly tiled anchors with the negative confidence scores larger than the threshold 0.99.
ODM takes over these refined anchors, and outputs top 400 high confident detections per image.
NMS with jaccard overlap of 0.45 per class is applied.
The top 200 high confident detections per image are retained to produce the final detection results.

6. Experimental Results

6.1. Ablation Study

**Effectiveness of various designs on VOC 2007 test set**

All models are trained on VOC 2007 and VOC 2012 trainval sets, and tested on VOC 2007 test set.

With low dimension input (i.e., 320×320), RefineDet produces 80.0% mAP with all above techniques, which is the first method achieving above 80% mAP with such small input images.

6.2. PASCAL VOC 2007 & 2012

**Detection results on PASCAL VOC 2007 & 2012**

6.2.1. VOC 2007

On VOC 2007, by using larger input size 512×512, RefineDet512 achieves 81.8% mAP, surpassing SSD and DSSD.
Comparing to the two-stage methods, RefineDet512 performs better than most of them except CoupleNet.
With multi-scale testing strategy, RefineDet achieves 83.1% (RefineDet320+) and 83.8% (RefineDet512+) mAPs, which are much better than the state-of-the-art methods.
RefineDet processes an image in 24.8ms (40.3 FPS) and 41.5ms (24.1 FPS) with input sizes 320×320 and 512×512, respectively.

RefineDet is the first real-time method to achieve detection accuracy above 80% mAP on PASCAL VOC 2007.

RefineDet associates fewer anchor boxes on the feature maps (e.g., 24564 anchor boxes in SSD512 vs. 16320 anchor boxes in RefineDet512).
Only YOLOv1 and SSD300 are slightly faster than RefineDet320, but their accuracy are 16.6% and 2.5% worse than RefineDet.

**Qualitative results of RefineDet512 on the PASCAL VOC 2007 test set**

6.2.2. VOC 2012

For VOC 2012, all methods are trained on VOC 2007 and VOC 2012 trainval sets plus VOC 2007 test set, and tested on VOC 2012 test set.
RefineDet320 obtains the top 78.1% mAP, which is even better than most of those two-stage methods using about 1000×600 input size.
Using the input size 512×512, RefineDet improves mAP to 80.1%, which is surpassing all one-stage methods and only slightly lower than CoupleNet.
With multi-scale testing, RefineDet obtains the state-of-the-art mAPs of 82.7% (RefineDet320+) and 83.5% (RefineDet512+).

**Qualitative results of RefineDet512 on the PASCAL VOC 2012 test set**

6.3. MS COCO

**Detection results on MS COCO test-dev set**

ResNet-101 based RefineDet is also used here.
The trainval35k set is used for training and the results are evaluated from test-dev evaluation server.
RefineDet320 with VGG-16 produces 29.4% AP that is better than all other methods based on VGG-16 (e.g., SSD512 and OHEM++).
RefineDet320 with ResNet-101 achieves 32.0% AP and RefineDet512 achieves 36.4% AP, exceeding most of the detection methods except TDM, Deformable R-FCN (DCN), RetinaNet800, umd_det, and G-RMI. All these methods use a much bigger input images for both training and testing.

With multi-scale testing, the best performance of RefineDet is 41.8%, which is the state-of-the-art, surpassing all published two-stage and one-stage approaches.

**Qualitative results of RefineDet512 on the MS COCO test-dev set**

6.4. Fine-Tuning MS COCO Models for PASCAL VOC

**Detection results on PASCAL VOC dataset**

By fine-tuning the detection models pretrained on MS COCO, RefineDet achieves 84.0% mAP (RefineDet320) and 85.2% mAP (RefineDet512) on VOC 2007 test set, and 82.7% mAP (RefineDet320) and 85.0% mAP (RefineDet512) on VOC 2012 test set.
After using the multi-scale testing, the detection accuracy are promoted to 85.6%, 85.8%, 86.0% and 86.8%, respectively.

The single model RefineNet512+ based on VGG-16 ranks as the top 5 on the VOC 2012 Leaderboard at that moment.

Reference

[2018 CVPR] [RefineDet]
Single-Shot Refinement Neural Network for Object Detection

Object Detection

2014: [OverFeat] [R-CNN]
2015: [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net]
2016: [OHEM] [CRAFT] [R-FCN] [ION] [MultiPathNet] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [SSD] [YOLOv1]
2017: [NoC] [G-RMI] [TDM] [DSSD] [YOLOv2 / YOLO9000] [FPN] [RetinaNet] [DCN / DCNv1] [Light-Head R-CNN] [DSOD] [CoupleNet]
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet] [RefineDet]
2019: [DCNv2] [Rethinking ImageNet Pre-training] [GRF-DSOD & GRF-SSD]