Review — ASFF: Learning Spatial Fusion for Single-Shot Object Detection

Using Adaptively Spatial Feature Fusion (ASFF), Outperforms YOLOv3, NAS-FPN, CenterNet, RetinaNet

Sik-Ho Tsang
5 min readJul 31, 2021
ASFF helps YOLOv3 outperform a range of state-of-the-art algorithms.

In this story, Learning Spatial Fusion for Single-Shot Object Detection, (ASFF), by Beihang University, is briefly reviewed. In this paper:

  • Adaptively Spatial Feature Fusion (ASFF) is proposed to learn the way to spatially filter conflictive information to suppress the inconsistency, thus improving the scale-invariance of features, and introduces nearly free inference overhead.

This is a paper in 2019 arXiv. (Sik-Ho Tsang @ Medium)


  1. Strong Baseline
  2. Adaptively Spatial Feature Fusion (ASFF)
  3. Experimental Results

1. Strong Baseline

  • In YOLOv3, there are two main components: an efficient backbone (DarkNet-53) and a feature pyramid network of three levels.
  • Int this paper, bag of freebies (BoF) (1–3) [43] are applied to build a stronger baseline on YOLOv3.
  • A bag of tricks (1–3) from [43] is used in the training process, such as
  1. mixup algorithm [12]
  2. cosine learning rate schedule [26], and
  3. the synchronized batch normalization technique [30].
  4. Besides those tricks, an anchor-free branch is added to run jointly with anchor-based ones as [45] does and exploit the anchor guiding mechanism proposed by [38] (Guided Anchoring, GA) to refine the results.
  5. Moreover, an extra Intersection over Union (IoU) loss function [41] is employed on the original smooth L1 loss for better bounding box regression.
Effect of each component on the baseline

With these advanced techniques mentioned above, 38.8% mAP is achieved on the COCO 2017 val set at a speed of 50 FPS (on Tesla V100), improving the original YOLOv3–608 baseline (33.0% mAP with 52 FPS [31]) by a large margin without heavy computational cost in inference.

  • (Yet, in this story, I would like to focus on ASFF, which is also the main contribution of this paper.)

2. Adaptively Spatial Feature Fusion (ASFF)

  • After building the stronger baseline, ASFF is proposed on top of it.
  • It consists of two steps: identically rescaling and adaptively fusing.

2.1. Identically Rescaling (Feature Resizing)

  • Because the features at three levels in YOLOv3 have different resolutions as well as different numbers of channels, the up-sampling and down-sampling strategies are modified for each scale accordingly.
  • For up-sampling, a 1×1 convolution layer is used to compress the number of channels of features to that in level l, and then upscale the resolutions respectively with interpolation.
  • For down-sampling with 1/2 ratio, a 3×3 convolution layer with a stride of 2 is used to modify the number of channels and the resolution simultaneously.
  • For the scale ratio of 1/4, a 2-stride max pooling layer is added before the 2-stride convolution.

2.2. Adaptive Fusing

Adaptively Spatial Feature Fusion (ASFF)
  • Let xnl_ij denote the feature vector at the position (i, j) on the feature maps resized from level n to level l. The features at the corresponding level l are fused as follows:
  • where
  • And they refer to the spatial importance weights for the feature maps at three different levels to level l, which are adaptively learned by the network:
  • They are computed using Softmax with λl_αij, λl_βij and λl_γij as control parameters respectively.
  • 1×1 convolution layers are used to compute the weight scalar maps λl_αij, λl_βij and λl_γij respectively, such that they can be learned through standard back-propagation.

Thus, the features at all the levels are adaptively aggregated at each scale.

  • (And there are paragraphs to show why these are important. If interested, please feel free to read the paper.)

3. Experimental Results

3.1. Visualization

Visualization of detection results on COCO val-2017 as well as the learned weight scalar maps at each level.
  • For the image in the first row, all the three zebras are predicted from the fused feature maps of level 1. It indicates that their center areas are dominated by the original features of level 1.
  • And the resized features within those areas from level 2 and 3 are filtered out (rightmost). This filtering guarantees that the features of these three zebras at level 2 and 3 are treated as background.

3.2. Comparison of ASFF and other fusion operations

APs (%) are reported on COCO val-2017.
  • Simply concatenation or sum (add) both sharply downgrade the performance on APL.
  • The inconsistency across different levels in feature pyramids brings negative influence on the training process and thus leaves the potential of pyramidal feature representation from being fully exploited.

3.3. SOTA Comparison

Detection performance in terms of AP (%) and FPS on COCO test-dev
  • The final model is YOLOv3 with ASFF*, which is an enhanced ASFF version by integrating other lightweight modules (i.e. DropBlock [7] and RFB [23]) with longer training time.
  • Keeping the high efficiency of YOLOv3, its performance is uplifted to the same level as the state-of-the-art single-shot detectors (e.g., FCOS [36], CenterNet [44], and NAS-FPN [8]).
  • Note that YOLOv3 can be evaluated at different input resolutions with the same weights, and the resolution of input images is lowered to pursue much faster detector, ASFF improves the performance more significantly.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.