Reading: ESPNet — Efficient Spatial Pyramid of Dilated Convolutions (Semantic Segmentation)

ESPNet Outperforms MobileNet, ShuffleNet, ENet, Faster Than PSPNet

4 min readOct 10, 2020

In this story, “ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation” (ESPNet), by Allen Institute for AI, and XNOR.AI, is shortly presented. In this paper:

A new convolutional module, efficient spatial pyramid (ESP), is introduced.
ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less.
ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet.
ESPNet can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively.

This is a paper in 2018 ECCV with about 200 citations. (Sik-Ho Tsang @ Medium)

Outline

ESP Module
ESPNet: Network Architecture
Experimental Results

1. ESP Module

1.1. Point-Wise Convolution & Spatial Pyramid of Dilated Convolutions

ESP module is a factorized form of convolutions that decompose a standard convolution into a point-wise convolution and a spatial pyramid of dilated convolutions.
The point-wise convolution applies a 1×1 convolution to project high-dimensional feature maps onto a low-dimensional space.
The spatial pyramid of dilated convolutions then re-samples these low-dimensional feature maps using K, n×n dilated convolutional kernels simultaneously, each with a dilation rate of 2^(k−1), k = {1, · · · ,K}. This factorization drastically reduces the number of parameters and the memory required by the ESP module, while preserving a large effective receptive field [(n−1)2^(K−1)+1]^2.

1.2. Hierarchical feature fusion (HFF)

While concatenating the outputs of dilated convolutions give the ESP module a large effective receptive field, it introduces unwanted checkerboard or gridding artifacts.
Thus, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them. This can reduce or remove the gridding artifacts.
To improve gradient flow inside the network, the input and output feature maps are combined using an element-wise sum.
The whole process of ESP module is Reduce-Split-Transform-Merge.

2. ESPNet: Network Architecture

ESPNet uses ESP modules for learning convolutional kernels as well as down-sampling operations, except for the first layer: a standard strided convolution. All layers are followed by a batch normalization and a PReLU except the last layer.
The last layer feeds into a softmax for pixel-wise classification.
(a) ESPNet-A: It is a standard network that takes an RGB image as an input and learns representations at different spatial levels using the ESP module to produce a segmentation mask.
(b) ESPNet-B: It improves the flow of information inside ESPNet-A by sharing the feature maps between the previous strided ESP module and the previous ESP module.
(c) ESPNet-C: It reinforces the input image inside ESPNet-B to further improve the flow of information.
These three variants produce outputs whose spatial dimensions are 1/8 of the input image.
(d) ESPNet: The fourth variant, ESPNet, adds a light weight decoder (built using a principle of reduce-upsample-merge) to ESPNet-C that outputs the segmentation mask of the same spatial resolution as the input image.
A hyper-parameter α controls the depth of the network; the ESP module is repeated α_l times at spatial level l.

3. Experimental Results

3.1. Comparison with Efficient Convolutional Modules

**Different Modules in ESPNet-C on Cityscape dataset**

The ESP modules in ESPNet-C are replaced with state-of-the-art efficient convolutional modules, as shown above.
ESP module outperformed MobileNet and ShuffleNet modules by 7% and 12%, respectively, while learning a similar number of parameters and having comparable network size and inference speed.
Furthermore, the ESP module delivered comparable accuracy to ResNeXt and Inception more efficiently.
A basic ResNet module (stack of two 3×3 convolutions with a skip-connection) delivered the best performance, but had to learn 6.5× more parameters.

3.2. SOTA Comparison

**Comparison between segmentation methods on the Cityscape test set**

ESPNet is 2% more accurate than ENet, while running 1.27× and 1.16× faster on a desktop and a laptop, respectively.
ESPNet had 8% lower category-wise mIOU than PSPNet, while learning 180× fewer parameters.
ESPNet had lower power consumption, had lower battery discharge rate, and was significantly faster than state-of-the-art methods.
ERFNet, an another efficient segmentation network, delivered good segmentation accuracy, but has 5.5× more parameters, is 5.44× larger, consumes more power, and has a higher battery discharge rate than ESPNet.
(There are also results for other datasets and detailed ablation study. If interested, please feel free to read the paper.)

Reference

[2018 ECCV] [ESPNet]
ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

Semantic Segmentation

[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [DPN] [ENet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [ERFNet] [GCN] [PSPNet] [DeepLabv3] [ESPNet] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+] [DRRN Zhang JNCA’20]