Reading: StairNet — Top-Down Semantic Aggregation (Object Detection)
In this story, “StairNet: Top-Down Semantic Aggregation for Accurate One Shot Detection” (StairNet), by KAIST, is shortly presented.
- One-stage detectors have difficulty in detecting small objects while they are competitive with two-stage methods on large objects since the lower layer that is responsible for small objects lacks strong semantics.
- In this paper, by introducing a feature combining module that spreads out the strong semantics in a top-down manner, the final model StairNet detector unifies the multi-scale representations and semantic distribution effectively.
This is a paper in 2018 WACV with over 40 citations. (Sik-Ho Tsang @ Medium)
- Feature Combining Module
- Unified Prediction Layer
- Experimental Results
1. Feature Combining Module
- SSD framework is used as the (meta)-architecture of StairNet.
- Authors augment the conventional SSD with our feature combining module in order to propagate the high-level abstraction features of top layer to lower layer.
- It consists of three parts: 1×1 convolution layer, deconvolution layer and 3×3 convolution layer. (Black dotted box)
- To combine two different size of feature maps, we added deconvolution layers that upsample by a factor of 2. (Green line) The features of upper layer, that have more strong semantics relative to lower layer, are delivered by this deconvolution layer.
- Before combining them together, it is essential to normalize features from different layers since it shows different scale distribution. Batch normalization is used to handle this problem.
- Element-wise add operation is then used for combining features. The combined features are passed down directly to the next deconvolution layer.
- To effectively mix the information from different streams (blue and green line), A 3×3 convolution layer (Red line) is used to construct the final enhanced feature maps before the classifier.
2. Unified Prediction Layer
- By adopting the unified classifier we can reduce the parameters of classifier.
- Also, for example there are many large cows but few small cows in PASCAL VOC 2007, this could be helpful because this module shares a single classifier over various scales: as the final feature representations of large cow and small cow are similar, the classifier trained with large cows would work well for small cows as well.
3. Experimental Results
3.1. Ablation Study
- Aspect ratio of 1.6 is removed as observed, there is no significant improvement with it.
- With unified prediction layer, there is observe no big difference on performance which indicates that all feature maps share similar degree of semantics. This justifies the effectiveness of our feature combining module that spreads out the information effectively.
- The learned deconvolution with upsampling-weights perform better than the naive upsampling kernels.
- Performance drop without the 3×3 convolution layer.
3.2. Impact on Different Object Sizes
- The proposed method shows significantly better performance than SSD on small scales.(8.9 mAP increase).
- StairNet wins on 18 classes among 20 classes.
3.3. PASCAL VOC 2007
- StairNet achieves a mAP of 78.8 %, which outperforms the SSD by 1.6 points.
- It even outperforms the DSSD which uses ResNet-101 as their base network.
3.4. PASCAL VOC 2012
- StairNet achieves 76.4% mAP, which outperforms the SSD by 1.6%.
- It outperforms DSSD, R-FCN, ION, Faster R-CNN and YOLOv2 as well.
3.5. Inference Time
- The inference time of StairNet measured using a NVIDIA-TITAN X GPU (pascal) along with CUDA 8.0 and cuDNN-v5.1.
- StairNet outperforms all the current one-stage methods in 30fps.
- Authors emphasized that Pytorch implementation runs slower and shows lower performance than original Caffe implementation for exactly same algorithm.
3.6. Qualitative results on PASCAL VOC 2007
- More small objects are detected using StairNet, which is consistent to the motivation of StairNet.
[2018 WACV] [StairNet]
StairNet: Top-Down Semantic Aggregation for Accurate One Shot Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN / DCNv1] [Light-Head R-CNN] [Cascade R-CNN] [MegDet] [StairNet] [DCNv2]