Reading: ESPNet — Efficient Spatial Pyramid of Dilated Convolutions (Semantic Segmentation)
ESPNet Outperforms MobileNet, ShuffleNet, ENet, Faster Than PSPNet
In this story, “ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation” (ESPNet), by Allen Institute for AI, and XNOR.AI, is shortly presented. In this paper:
- A new convolutional module, efficient spatial pyramid (ESP), is introduced.
- ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less.
- ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet.
- ESPNet can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively.
This is a paper in 2018 ECCV with about 200 citations. (Sik-Ho Tsang @ Medium)
Outline
- ESP Module
- ESPNet: Network Architecture
- Experimental Results
1. ESP Module
1.1. Point-Wise Convolution & Spatial Pyramid of Dilated Convolutions
- ESP module is a factorized form of convolutions that decompose a standard convolution into a point-wise convolution and a spatial pyramid of dilated convolutions.
- The point-wise convolution applies a 1×1 convolution to project high-dimensional feature maps onto a low-dimensional space.
- The spatial pyramid of dilated convolutions then re-samples these low-dimensional feature maps using K, n×n dilated convolutional kernels simultaneously, each with a dilation rate of 2^(k−1), k = {1, · · · ,K}. This factorization drastically reduces the number of parameters and the memory required by the ESP module, while preserving a large effective receptive field [(n−1)2^(K−1)+1]^2.
1.2. Hierarchical feature fusion (HFF)
- While concatenating the outputs of dilated convolutions give the ESP module a large effective receptive field, it introduces unwanted checkerboard or gridding artifacts.
- Thus, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them. This can reduce or remove the gridding artifacts.
- To improve gradient flow inside the network, the input and output feature maps are combined using an element-wise sum.
- The whole process of ESP module is Reduce-Split-Transform-Merge.
2. ESPNet: Network Architecture
- ESPNet uses ESP modules for learning convolutional kernels as well as down-sampling operations, except for the first layer: a standard strided convolution. All layers are followed by a batch normalization and a PReLU except the last layer.
- The last layer feeds into a softmax for pixel-wise classification.
- (a) ESPNet-A: It is a standard network that takes an RGB image as an input and learns representations at different spatial levels using the ESP module to produce a segmentation mask.
- (b) ESPNet-B: It improves the flow of information inside ESPNet-A by sharing the feature maps between the previous strided ESP module and the previous ESP module.
- (c) ESPNet-C: It reinforces the input image inside ESPNet-B to further improve the flow of information.
- These three variants produce outputs whose spatial dimensions are 1/8 of the input image.
- (d) ESPNet: The fourth variant, ESPNet, adds a light weight decoder (built using a principle of reduce-upsample-merge) to ESPNet-C that outputs the segmentation mask of the same spatial resolution as the input image.
- A hyper-parameter α controls the depth of the network; the ESP module is repeated α_l times at spatial level l.
3. Experimental Results
3.1. Comparison with Efficient Convolutional Modules
- The ESP modules in ESPNet-C are replaced with state-of-the-art efficient convolutional modules, as shown above.
- ESP module outperformed MobileNet and ShuffleNet modules by 7% and 12%, respectively, while learning a similar number of parameters and having comparable network size and inference speed.
- Furthermore, the ESP module delivered comparable accuracy to ResNeXt and Inception more efficiently.
- A basic ResNet module (stack of two 3×3 convolutions with a skip-connection) delivered the best performance, but had to learn 6.5× more parameters.
3.2. SOTA Comparison
- ESPNet is 2% more accurate than ENet, while running 1.27× and 1.16× faster on a desktop and a laptop, respectively.
- ESPNet had 8% lower category-wise mIOU than PSPNet, while learning 180× fewer parameters.
- ESPNet had lower power consumption, had lower battery discharge rate, and was significantly faster than state-of-the-art methods.
- ERFNet, an another efficient segmentation network, delivered good segmentation accuracy, but has 5.5× more parameters, is 5.44× larger, consumes more power, and has a higher battery discharge rate than ESPNet.
- (There are also results for other datasets and detailed ablation study. If interested, please feel free to read the paper.)
Reference
[2018 ECCV] [ESPNet]
ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation
Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [DPN] [ENet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [ERFNet] [GCN] [PSPNet] [DeepLabv3] [ESPNet] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+] [DRRN Zhang JNCA’20]