Reading: FishNet / FishNeXt — A Versatile Backbone for Image, Region, and Pixel Level Prediction (Image Classification)

Fish-Like Network, Outperforms ResNet & DenseNet With Fewer Number of Parameters, Outperforms ResNeXt With Similar Number of Parameters

Sik-Ho Tsang
4 min readJun 16, 2020

In this story, FishNet, by The University of Sydney, SenseTime Research, and Zhejiang University, is briefly presented. In this paper:

  • A fish-like network is designed, unifying the advantages of networks designed for pixel-level or region-level predicting tasks.
  • The information of all resolutions is preserved and refined for the final task.
  • With the use of group convolution, FishNeXt is designed.
  • This work is the first to extract high-resolution deep feature with high-level semantic meaning and improve image classification accuracy at the same time.

This is a paper in 2018 NeurIPS, with 30 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. FishNet: Network Architecture
  2. FishNet: Detailed Interaction Between Tail, Body & Head
  3. Experimental Results

1. FishNet: Network Architecture

FishNet: Network Architecture
  • FishNet has three parts.
  • Tail uses existing works to obtain deep low-resolution features from the input image.
  • Body obtains high-resolution features of high-level semantic information.
  • Head preserves and refines the features from the three parts.

1.1. Design in Combining Features from Different Layers.

  • Additions, which used in ResNet as shown above, only mix the features of different abstraction levels, but cannot preserve or refine both of them. Shallow features serve only for refining the deep features, which will be discarded after the residual blocks
  • Concatenation is used to concatenate shallow features and deep features together and refined each other.

1.2. Gradient Propagation Problem from Isolated Convolution (I-conv)

Up/Down-sampling in (a) ResNet and (b) FishNet
  • In ResNet, when downsampling, 1×1 convolution is used which becomes an Isolated convolution (I-conv).
  • This I-conv prevents from efficient direct gradient propagation for shallow layers since it is no longer a skip connection.
  • While in FishNet, local and global concatenation is used before up-down sampling. No I-conv is used to maintain the skip connection.
  • The nearest neighbor interpolation is used for up-sampling.
  • And max-pooling is used for down-sampling.

2. FishNet: Detailed Interaction Between Tail, Body & Head

FishNet: Detailed Interaction Between Tail, Body & Head
  • Tail: Convolution and downsampling are performed.
  • Body: Convolution and upsampling are performed. At the same time, there is also concatenation of feature maps from fish tail. With both tail and body, it is similar to U-Net.
  • Head: Since this is an image classification task, convolution and downsampling are performed again , with also the concatenation of feature maps from shallow layers.
  • k: is the reduction rate of feature maps. Element-wise summation of feature maps is performed from the adjacent k channels to 1 channel.
  • As the tail part will down sample the features into resolution 1×1, these 1×1 features need to be upsampled to 7×7. SE-block in SENet is used here to map the feature from 1×1 into 7×7 using a channel-wise attention operation.

3. Experimental Results

3.1. Ablation Study for Downsampling on ImageNet

ImageNet classification top-1 error rates
  • Using non-overlapping max-pooling obtains the best performance among the above variants.

3.2. Comparison with DenseNet & ResNet on ImageNet

Comparison of the ImageNet classification top-1 error rates
  • Higher, Lower error rates. Lefter, fewer number of parameters/FLOPs.
  • FishNet obtains the same error rates with fewer number of parameters, and fewer FLOPs, which outperforms DenseNet and ResNet.

3.3. Comparison with ResNeXt on ImageNet

Comparison with ResNeXt: ImageNet-1k val Top-1 error
  • With the use of group convolution, FishNet becomes FishNeXt.
  • And FishNeXt obtains lower Top-1 error rate with similar number of parameters.

3.4. MS COCO Val-2017 Detection & Instance Segmentation

AP on MS COCO val-2017 detection and instance segmentation
  • FishNet-150 obtains higher APs for both instance segmentation and object detection using Mask R-CNN and FPN, outperforms ResNet-50 and ResNeXt-50.

This is the 25th story in this month!

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.