Review — GRF-DSOD & GRF-SSD: Improving Object Detection from Scratch via Gated Feature Reuse (Object Detection)

Outperforms DSOD, DSSD, SSD, R-FCN, Faster R-CNN, DCN / DCNv1, FPN With Fewer Parameters

Sik-Ho Tsang
Nerd For Tech

--

An overview of the proposed GFR-DSOD

In this paper, Improving Object Detection from Scratch via Gated Feature Reuse, (GRF-DSOD & GRF-SSD), by Carnegie Mellon University, University of Illinois at Urbana-Champaign, IBM Research AI, MIT-IBM Watson AI Lab, Google AI & UMass Amherst, and Stevens Institute of Technology, is reviewed. In this paper:

  • A Gated Feature Reuse (GFR) module is proposed, to enable Squeeze-and-Excitation to adaptively enhance or attenuate supervision.
  • A feature-pyramids structure to squeeze rich spatial and semantic features into a single prediction layer, which strengthens feature representation and reduces the number of parameters to learn.
  • It is noted that this network can be trained from scratched, no need ImageNet pre-training.

This is a paper in 2019 BMVC. (Sik-Ho Tsang @ Medium)

Outline

  1. Iterative Feature Re-Utilization
  2. Gate-Controlled Adaptive Recalibration
  3. Feature Reuse for DSOD and SSD
  4. Experimental Results

1. Iterative Feature Re-Utilization

A building block illustrating the iterative feature pyramids
  • As in DSOD (The first figure), the feature maps at different scales are generated the large-scale feature maps are downsampled and concatenated with the current feature maps.
  • Here, except the downsampled feature maps, the small-scale feature maps are upsampled and concatenated with the current feature maps as well.
  • The downsampling pathway consists mainly of a max pooling layer (kernel size=2×2, stride=2), followed by a conv-layer (kernel size=1×1, stride = 1) to reduce channel dimensions.
  • The up-sampling pathway conducts a deconvolutional operation via bilinear upsampling followed by a conv-layer (kernel size = 1×1, stride = 1).
  • With coarser-resolution and fine-resolution features, a bottleneck block with a 1×1 conv-layer plus a 3×3 conv-layer is introduced to learn new features.
  • The number of parameters is one-third compared with DSOD.

With the upsampling and downsampling, feature maps at different scales can be concatenated together to detect different sizes of objects, as shown above.

2. Gate-Controlled Adaptive Recalibration

Illustration of the structure of a gate, including: (i) channel-level attention; (ii) global-level attention; and (iii) identity mapping.

2.1. Channel-Level Attention

  • The Squeeze-and-Excitation block in SENet is used.
  • The squeeze stage can be formulated as a global pooling operation on U:
  • The excitation stage is two fully-connected layers plus a sigmoid activation:
  • where σ is the sigmoid function.
  • Then, ~U is calculated by:

where ⨂ denotes channel-wise multiplication.

2.2. Global-Level Attention

  • The global attention takes s (the output of squeeze stage) as input, and generates only one element.
  • where ¯e is the global attention.
  • Finally, ~V is calculated by:

2.3. Identity Mapping

  • An element-wise addition operation is performed to obtain the final output:

3. Feature Reuse for DSOD and SSD

  • The proposed method is a generic solution for building iterative feature pyramids and gates inside deep convolutional neural networks based detectors, thus it’s very easy to apply to existing frameworks.

3.1. GRF-DSOD

  • There are two steps to adapt Gated Feature Reuse for DSOD.
  • First, the iterative feature reuse is to replace the dense connection in DSOD prediction layers.
  • Following that, gates are added in each prediction layer.

3.2. GRF-SSD

  • For SSD, similar operations are conducted to obtain GFR-SSD. Specifically, the extra layers in SSD are replaced with GFR structure and cascade gates in prediction layers.

4. Experimental Results

  • It is noted that this network can be trained from scratched, no need ImageNet pre-training.
  • For VOC 2007, the network is trained using the union of VOC 2007 trainval and VOC 2012 trainval (“07+12”) and test on VOC 2007 test set.
  • For VOC 2012, the network is trained usingVOC 2012 trainval and VOC 2007 trainval + test for training, and test on VOC 2012 test set.
  • For COCO, 80k images in training set, 40k in validation set and 20k in testing set (test-dev).

4.1. Ablation Experiments on PASCAL VOC 2007

Ablation Experiments of gate structure design on PASCAL VOC 2007
  • After adopting channel attention, global attention and identity mapping, we obtain gains of 0.4%, 0.2% and 0.2%, respectively.
Ablation Experiments on PASCAL VOC 2007
  • The results of the feature pyramids without the gates (78.6%) is on par with GFR-DSOD320 (row 6) and achieves 0.8% improvement comparing with baseline (77.8%).
  • It indicates that our feature reuse structure contribute a lot on boosting the final detection performance.
  • The results of adding gates without the iterative feature pyramids (78.6%) also outperforms the baseline result by 0.8% mAP.
Ablation Experiments of SSD300 from scratch on PASCAL VOC 2007
  • Also, GFR structure helps the original SSD to improve the performance by a large margin.

4.2. Results on PASCAL VOC 2007 & 2012

  • In the second table, GFR-DSOD achieves 79.2%, which is better than baseline method DSOD (77.8%).
Detection examples on VOC 2007 test set with DSOD / GFR-DSOD models
  • The proposed method achieves better results on both small objects and dense scenes.
Convergence Speed Comparison
  • Thus, GFR-DSOD has relative 38% faster convergence speed than DSOD.
  • For the inference time, With 300×300 input, the full GFR-DSOD can run an image at 17.5 fps on a single Titan X GPU with batch size 1. The speed is similar to DSOD300 with the dense prediction structure.
  • When enlarging the input size to 320×320, the speed decrease to 16.7 fps and 16.3 fps (with more default boxes).
  • As comparisons, SSD321 runs at 11.2 fps and DSSD321 runs at 9.5 fps with ResNet-101 backbone network. The method is much faster than these two competitors.
  • On PASCAL VOC 2012 Comp3 Challenge, GFR-DSOD result (72.5%) outperforms the previous state-of-the-art DSOD (70.8%) by 1.7% mAP.
  • After adding VOC 2007 as training data, 77.5% mAP is obtained.

4.3. Results on MS COCO

Comparisons of two-stage detectors on MS COCO 2015 test-dev set.
  • GFR-DSOD can achieve higher performance than the baseline method DSOD (30.0% vs. 29.4%) with fewer parameters (21.2M vs. 21.9M).
  • The result is comparable with FPN320/540 [22] (30.0% vs. 29.7%), but the parameters of the model is only 1/6 of FPN.
  • So, finally, GFR-DSOD outperforms DSOD, DSSD, SSD, R-FCN, Faster R-CNN, DCN / DCNv1, and FPN for {0.5:0.95} mAP.

--

--

Sik-Ho Tsang
Nerd For Tech

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.