Reading: ESPNet — Efficient Spatial Pyramid of Dilated Convolutions (Semantic Segmentation)

ESPNet Outperforms MobileNet, ShuffleNet, ENet, Faster Than PSPNet

Sik-Ho Tsang
4 min readOct 10, 2020

In this story, “ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation” (ESPNet), by Allen Institute for AI, and XNOR.AI, is shortly presented. In this paper:

  • A new convolutional module, efficient spatial pyramid (ESP), is introduced.
  • ESPNet is 22 times faster (on a standard GPU) and 180 times smaller than the state-of-the-art semantic segmentation network PSPNet, while its category-wise accuracy is only 8% less.
  • ESPNet outperforms all the current efficient CNN networks such as MobileNet, ShuffleNet, and ENet.
  • ESPNet can process high resolution images at a rate of 112 and 9 frames per second on a standard GPU and edge device, respectively.

This is a paper in 2018 ECCV with about 200 citations. (Sik-Ho Tsang @ Medium)


  1. ESP Module
  2. ESPNet: Network Architecture
  3. Experimental Results

1. ESP Module

ESP Module

1.1. Point-Wise Convolution & Spatial Pyramid of Dilated Convolutions

  • ESP module is a factorized form of convolutions that decompose a standard convolution into a point-wise convolution and a spatial pyramid of dilated convolutions.
  • The point-wise convolution applies a 1×1 convolution to project high-dimensional feature maps onto a low-dimensional space.
  • The spatial pyramid of dilated convolutions then re-samples these low-dimensional feature maps using K, n×n dilated convolutional kernels simultaneously, each with a dilation rate of 2^(k−1), k = {1, · · · ,K}. This factorization drastically reduces the number of parameters and the memory required by the ESP module, while preserving a large effective receptive field [(n−1)2^(K−1)+1]^2.

1.2. Hierarchical feature fusion (HFF)

ESP Strategy
  • While concatenating the outputs of dilated convolutions give the ESP module a large effective receptive field, it introduces unwanted checkerboard or gridding artifacts.
  • Thus, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them. This can reduce or remove the gridding artifacts.
  • To improve gradient flow inside the network, the input and output feature maps are combined using an element-wise sum.
  • The whole process of ESP module is Reduce-Split-Transform-Merge.

2. ESPNet: Network Architecture

ESPNet Variants
  • ESPNet uses ESP modules for learning convolutional kernels as well as down-sampling operations, except for the first layer: a standard strided convolution. All layers are followed by a batch normalization and a PReLU except the last layer.
  • The last layer feeds into a softmax for pixel-wise classification.
  • (a) ESPNet-A: It is a standard network that takes an RGB image as an input and learns representations at different spatial levels using the ESP module to produce a segmentation mask.
  • (b) ESPNet-B: It improves the flow of information inside ESPNet-A by sharing the feature maps between the previous strided ESP module and the previous ESP module.
  • (c) ESPNet-C: It reinforces the input image inside ESPNet-B to further improve the flow of information.
  • These three variants produce outputs whose spatial dimensions are 1/8 of the input image.
  • (d) ESPNet: The fourth variant, ESPNet, adds a light weight decoder (built using a principle of reduce-upsample-merge) to ESPNet-C that outputs the segmentation mask of the same spatial resolution as the input image.
  • A hyper-parameter α controls the depth of the network; the ESP module is repeated α_l times at spatial level l.

3. Experimental Results

3.1. Comparison with Efficient Convolutional Modules

Different Modules in ESPNet-C on Cityscape dataset
  • The ESP modules in ESPNet-C are replaced with state-of-the-art efficient convolutional modules, as shown above.
  • ESP module outperformed MobileNet and ShuffleNet modules by 7% and 12%, respectively, while learning a similar number of parameters and having comparable network size and inference speed.
  • Furthermore, the ESP module delivered comparable accuracy to ResNeXt and Inception more efficiently.
  • A basic ResNet module (stack of two 3×3 convolutions with a skip-connection) delivered the best performance, but had to learn 6.5× more parameters.

3.2. SOTA Comparison

Comparison between segmentation methods on the Cityscape test set
  • ESPNet is 2% more accurate than ENet, while running 1.27× and 1.16× faster on a desktop and a laptop, respectively.
  • ESPNet had 8% lower category-wise mIOU than PSPNet, while learning 180× fewer parameters.
  • ESPNet had lower power consumption, had lower battery discharge rate, and was significantly faster than state-of-the-art methods.
  • ERFNet, an another efficient segmentation network, delivered good segmentation accuracy, but has 5.5× more parameters, is 5.44× larger, consumes more power, and has a higher battery discharge rate than ESPNet.
  • (There are also results for other datasets and detailed ablation study. If interested, please feel free to read the paper.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.