Reading: FishNet / FishNeXt — A Versatile Backbone for Image, Region, and Pixel Level Prediction (Image Classification)

Fish-Like Network, Outperforms ResNet & DenseNet With Fewer Number of Parameters, Outperforms ResNeXt With Similar Number of Parameters

4 min readJun 16, 2020

In this story, FishNet, by The University of Sydney, SenseTime Research, and Zhejiang University, is briefly presented. In this paper:

A fish-like network is designed, unifying the advantages of networks designed for pixel-level or region-level predicting tasks.
The information of all resolutions is preserved and refined for the final task.
With the use of group convolution, FishNeXt is designed.
This work is the first to extract high-resolution deep feature with high-level semantic meaning and improve image classification accuracy at the same time.

This is a paper in 2018 NeurIPS, with 30 citations. (Sik-Ho Tsang @ Medium)

FishNet has three parts.
Tail uses existing works to obtain deep low-resolution features from the input image.
Body obtains high-resolution features of high-level semantic information.
Head preserves and refines the features from the three parts.

Additions, which used in ResNet as shown above, only mix the features of different abstraction levels, but cannot preserve or refine both of them. Shallow features serve only for refining the deep features, which will be discarded after the residual blocks
Concatenation is used to concatenate shallow features and deep features together and refined each other.

In ResNet, when downsampling, 1×1 convolution is used which becomes an Isolated convolution (I-conv).
This I-conv prevents from efficient direct gradient propagation for shallow layers since it is no longer a skip connection.
While in FishNet, local and global concatenation is used before up-down sampling. No I-conv is used to maintain the skip connection.
The nearest neighbor interpolation is used for up-sampling.
And max-pooling is used for down-sampling.

Tail: Convolution and downsampling are performed.
Body: Convolution and upsampling are performed. At the same time, there is also concatenation of feature maps from fish tail. With both tail and body, it is similar to U-Net.
Head: Since this is an image classification task, convolution and downsampling are performed again , with also the concatenation of feature maps from shallow layers.
k: is the reduction rate of feature maps. Element-wise summation of feature maps is performed from the adjacent k channels to 1 channel.
As the tail part will down sample the features into resolution 1×1, these 1×1 features need to be upsampled to 7×7. SE-block in SENet is used here to map the feature from 1×1 into 7×7 using a channel-wise attention operation.

Using non-overlapping max-pooling obtains the best performance among the above variants.

Higher, Lower error rates. Lefter, fewer number of parameters/FLOPs.
FishNet obtains the same error rates with fewer number of parameters, and fewer FLOPs, which outperforms DenseNet and ResNet.

FishNet-150 obtains higher APs for both instance segmentation and object detection using Mask R-CNN and FPN, outperforms ResNet-50 and ResNeXt-50.

This is the 25th story in this month!