Review — ASFF: Learning Spatial Fusion for Single-Shot Object Detection
In this story, Learning Spatial Fusion for Single-Shot Object Detection, (ASFF), by Beihang University, is briefly reviewed. In this paper:
- Adaptively Spatial Feature Fusion (ASFF) is proposed to learn the way to spatially filter conflictive information to suppress the inconsistency, thus improving the scale-invariance of features, and introduces nearly free inference overhead.
This is a paper in 2019 arXiv. (Sik-Ho Tsang @ Medium)
- Strong Baseline
- Adaptively Spatial Feature Fusion (ASFF)
- Experimental Results
1. Strong Baseline
- In YOLOv3, there are two main components: an efficient backbone (DarkNet-53) and a feature pyramid network of three levels.
- Int this paper, bag of freebies (BoF) (1–3)  are applied to build a stronger baseline on YOLOv3.
- A bag of tricks (1–3) from  is used in the training process, such as
- mixup algorithm 
- cosine learning rate schedule , and
- the synchronized batch normalization technique .
- Besides those tricks, an anchor-free branch is added to run jointly with anchor-based ones as  does and exploit the anchor guiding mechanism proposed by  (Guided Anchoring, GA) to refine the results.
- Moreover, an extra Intersection over Union (IoU) loss function  is employed on the original smooth L1 loss for better bounding box regression.
With these advanced techniques mentioned above, 38.8% mAP is achieved on the COCO 2017 val set at a speed of 50 FPS (on Tesla V100), improving the original YOLOv3–608 baseline (33.0% mAP with 52 FPS ) by a large margin without heavy computational cost in inference.
- (Yet, in this story, I would like to focus on ASFF, which is also the main contribution of this paper.)
2. Adaptively Spatial Feature Fusion (ASFF)
- After building the stronger baseline, ASFF is proposed on top of it.
- It consists of two steps: identically rescaling and adaptively fusing.
2.1. Identically Rescaling (Feature Resizing)
- Because the features at three levels in YOLOv3 have different resolutions as well as different numbers of channels, the up-sampling and down-sampling strategies are modified for each scale accordingly.
- For up-sampling, a 1×1 convolution layer is used to compress the number of channels of features to that in level l, and then upscale the resolutions respectively with interpolation.
- For down-sampling with 1/2 ratio, a 3×3 convolution layer with a stride of 2 is used to modify the number of channels and the resolution simultaneously.
- For the scale ratio of 1/4, a 2-stride max pooling layer is added before the 2-stride convolution.
2.2. Adaptive Fusing
- Let xn→l_ij denote the feature vector at the position (i, j) on the feature maps resized from level n to level l. The features at the corresponding level l are fused as follows:
- And they refer to the spatial importance weights for the feature maps at three different levels to level l, which are adaptively learned by the network:
- They are computed using Softmax with λl_αij, λl_βij and λl_γij as control parameters respectively.
- 1×1 convolution layers are used to compute the weight scalar maps λl_αij, λl_βij and λl_γij respectively, such that they can be learned through standard back-propagation.
Thus, the features at all the levels are adaptively aggregated at each scale.
- (And there are paragraphs to show why these are important. If interested, please feel free to read the paper.)
3. Experimental Results
- For the image in the first row, all the three zebras are predicted from the fused feature maps of level 1. It indicates that their center areas are dominated by the original features of level 1.
- And the resized features within those areas from level 2 and 3 are filtered out (rightmost). This filtering guarantees that the features of these three zebras at level 2 and 3 are treated as background.
3.2. Comparison of ASFF and other fusion operations
- Simply concatenation or sum (add) both sharply downgrade the performance on APL.
- The inconsistency across different levels in feature pyramids brings negative influence on the training process and thus leaves the potential of pyramidal feature representation from being fully exploited.
3.3. SOTA Comparison
- The final model is YOLOv3 with ASFF*, which is an enhanced ASFF version by integrating other lightweight modules (i.e. DropBlock  and RFB ) with longer training time.
- Keeping the high efficiency of YOLOv3, its performance is uplifted to the same level as the state-of-the-art single-shot detectors (e.g., FCOS , CenterNet , and NAS-FPN ).
- Note that YOLOv3 can be evaluated at different input resolutions with the same weights, and the resolution of input images is lowered to pursue much faster detector, ASFF improves the performance more significantly.
[2019 arXiv] [ASFF]
Learning Spatial Fusion for Single-Shot Object Detection
2014: [OverFeat] [R-CNN]
2015: [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net]
2016: [OHEM] [CRAFT] [R-FCN] [ION] [MultiPathNet] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [SSD] [YOLOv1]
2017: [NoC] [G-RMI] [TDM] [DSSD] [YOLOv2 / YOLO9000] [FPN] [RetinaNet] [DCN / DCNv1] [Light-Head R-CNN] [DSOD] [CoupleNet]
2018: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet] [RefineDet] [CornerNet]
2019: [DCNv2] [Rethinking ImageNet Pre-training] [GRF-DSOD & GRF-SSD] [CenterNet] [Grid R-CNN] [NAS-FPN] [ASFF]