Using Adaptively Spatial Feature Fusion (ASFF), Outperforms YOLOv3, NAS-FPN, CenterNet, RetinaNet

ASFF helps YOLOv3 outperform a range of state-of-the-art algorithms.
  • Adaptively Spatial Feature Fusion (ASFF) is proposed to learn the way to spatially filter conflictive information to suppress the inconsistency, thus improving the scale-invariance of features, and introduces nearly free inference overhead.


  1. Strong Baseline
  2. Adaptively Spatial Feature Fusion (ASFF)
  3. Experimental Results

1. Strong Baseline

  • In YOLOv3, there are two main components: an efficient backbone (DarkNet-53) and a feature pyramid network of three levels.
  • Int this paper, bag of freebies (BoF) (1–3) [43]

Using Autoencoder at Discriminator, Using Repelling Regularizer at Generator

EBGAN: low energies to the regions near the data manifold and higher energies to other regions. (Figure from
  • EBGAN views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions.
  • Similar to the probabilistic GANs, the generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples.


  1. Energy-Based Model
  2. Loss Functions
  3. Autoencoder…

Model FLOPs vs. COCO accuracy
  • First, a weighted bi-directional feature pyramid network (BiFPN) is proposed, which allows easy and fast multi-scale feature fusion.
  • Then, a compound scaling method is also proposed which can uniformly scale the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
  • Finally, with EfficientNet as backbones, a family of object detectors, EfficientDet, is formed, consistently achieve much better efficiency than prior art, as shown above.

Searching Architectures for FPN, Combining with RetinaNet, Outperforms MobileNetV2 and Mask R-CNN with Less Complexity

NAS-FPN (Left); And Average Precision (AP) vs. Inference Time Per Image (ms) Across Accurate Models (Middle) and Fast Models (Right) on Mobile Device
  • FPN is widely used in object detection task as it efficiently searched for small and large objects via multi-scale feature maps in pyramid-style.
  • While there are prior arts searching the feature extraction parts or the backbone, Neural architecture search (NAS), this time, is used to search the FPN architecture.
  • Finally, the discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales.
  • This NAS-FPN can be stacked N times for better accuracy…

Desoiling Dataset is Proposed; CycleGAN for Desoiling

Left: soiled camera lens mounted to the car body; Middle: the image quality of the soiled camera from the previous image; Right: an example of image soiled by a heavy rain.
  • Surround-view cameras can get soiled easily. When cameras get soiled, the degradation of performance is usually more dramatic compared to other sensors.
  • First, a Desoiling Dataset is constructed, which contains 40+ approximately 1 minute long video sequences with paired image information of both clean and soiled nature.
  • Then, CycleGAN is used for desoiling.


  1. Motivation
  2. Desoiling Dataset
  3. CycleGAN for Desoiling
  4. Experimental Results

1. Motivation

  • Surround view cameras are becoming de facto…

Training GAN Using Least Square Loss, Higher Image Quality, Improved Training Stability

  • Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function may lead to the vanishing gradients problem during the learning process.
  • LSGAN is proposed, in which the least squares loss function is adopted for the discriminator.
  • First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process.

Using MUNIT, Multi-Style Images Generated From Single Image

Animal image translation using MUNIT
  • In MUNIT, it is assumed that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties.
  • To translate an image to another domain, its content code is recombined with a random style code sampled from the style space of the target domain.
  • Finally, MUNIT allows users to control the style of translation outputs by providing an example style image.

ADE20K dataset is constructed as the benchmarks for scene parsing and instance segmentation

The list of the objects and their associated parts in the image
  • ADE20K dataset is constructed as the benchmarks for scene parsing and instance segmentation.
  • The effect of synchronized batch normalization is evaluated and it is found that a reasonably large batch size is crucial for the semantic segmentation performance.


  1. ADE20K Dataset
  2. Scene Parsing Benchmark

1. ADE20K Dataset

Cascade-SegNet & Cascade-DilatedNet is Formed Using Cascade Segmentation Module, Outperforms DilatedNet, SegNet & FCN

ADE20K Dataset (The first row shows the sample images, the second row shows the annotation of objects and stuff, and the third row shows the annotation of object parts.)
  • Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines.
  • This module is integrated with SegNet and DilatedNet to form the Cascade-SegNet and Cascade-DilatedNet respectively.


  1. Cascade Segmentation Module
  2. Experimental Results

1. Cascade Segmentation Module

AutoAugment Helps to Find the Best Data Augmentation Policy

One of the data augmentation policies found on SVHN
  • With AutoAugment, the augmentation policy is searched using small dataset.
  • After searching the optimized augmentation policy, it is applied to train the entire dataset with higher accuracy achieved.


  1. Search Space & Search Algorithm
  2. The Controller RNN
  3. Experimental Results

1. Search Space & Search Algorithm

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store