Reading: ESPNetv2 — A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network (Image Classification, Semantic Segmentation, etc)

More Efficient Than ShuffleNet V2, Outperforms CondenseNet, DenseNet, Xception, ShuffleNet V1, IGCV3, MobileNetV2, MobileNetV1 in IC, SegNet, ESPNet & ENet in SS, YOLOv2 & SSD in OD

In this story, “ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network” (ESPNetv2), by University of Washington, Allen Institute for AI (AI2), and XNOR.AI, is presented. In this paper:

  1. EESP: Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions are proposed to learn representations from a large effective receptive field with fewer FLOPs and parameters.
  • It can be used in four tasks: (1) image/multi-object classification, (2) semantic segmentation, (3) object detection, and (4) language modeling.
  • In particular, ESPNetv2 outperforms ESPNet by 4–5% and has 2−4× fewer FLOPs on the PASCAL VOC and the Cityscapes dataset.

This is a paper in 2019 CVPR with over 70 citations. (Sik-Ho Tsang @ Medium)


  1. EESP: Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions
  2. Strided EESP
  3. ESPNetv2: Network Architecture
  4. Ablation Study
  5. Experimental Results

1. EESP: Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions

EESP: Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions
  • (a) ESP: This is the ESP module in ESPNet.
  • Different dilation rates in each branch allow the ESP unit to learn the representations from a large effective receptive field.
  • (If interested, please feel free to read ESPNet.)
  • (b) EESP-A: To make ESP module more efficient, EESP is proposed.
  • First, replace point-wise convolutions with group point-wise convolutions.
  • Then, replace computationally expensive 3×3 dilated convolutions with their economical counterparts i.e. depth-wise dilated separable convolutions.
  • To remove the gridding artifacts, hierarchical feature fusion (HFF) method is still adopted. But an additional 1×1 convolution is added at each branch.
  • For example, the EESP unit learns 7× fewer parameters than the ESP unit when M=240, g=K=4, and d=M/K=60, where M is the number of input feature maps, d is the number of output feature maps, K is the number of parallel branches and g is the number of groups in group point-wise convolution.
  • (c) EESP: Computing K point-wise (or 1 × 1) convolutions in (b) independently is equivalent to a single group point-wise convolution with K groups in terms of complexity; however, group point-wise convolution is more efficient in terms of implementation.
  • Therefore, replace these K point-wise convolutions with a group point-wise convolution.

2. Strided EESP

Strided EESP

2.1. The Right Path (Black Color)

  • Strided EESP is used when the feature maps are needed to be downsampled.
  • First depth-wise dilated convolutions are replaced with their strided counterpart.
  • Then, an average pooling operation is added instead of an identity connection.
  • And the element-wise addition operation is replaced with a concatenation operation, which helps in expanding the dimensions of feature maps efficiently.

2.2. The Left Path (Red Color)

  • Spatial information is lost during down-sampling and convolution (filtering) operations.
  • An efficient long-range shortcut connection is added.
  • First down-samples the image to the same size as that of the feature map and then learns the representations using a stack of two convolutions.
  • The first convolution is a standard 3×3 convolution that learns the spatial representations while the second convolution is a point-wise 1×1 convolution.

3. ESPNetv2: Network Architecture

ESPNetv2: Network Architecture
  • The ESPNetv2 networks are built using EESP units, as shown above.
  • At each spatial level, the ESPNetv2 repeats the EESP units several times to increase the depth of the network.
  • Batch normalization and PReLU are used after every convolutional layer with an exception to the last group-wise convolutional layer where PReLU is applied after element-wise sum operation.
  • K is set to 4 and g = K.
Cyclic learning rate policy
  • Cyclic learning rate policy is used for better convergence. This learning rate scheme can be seen as a variant of the cosine learning policy.
  • The cycle length is different according to the epoch number.
  • (If interested, please read Ref[28] in this paper.)

4. Ablation Study

4.1. Impact of different convolutions

Ablation Study on ImageNet dataset
  • Clearly, depth-wise dilated separable convolutions are more effective than dilated and depth-wise convolutions.

4.2. Impact of hierarchical feature fusion (HFF)

Ablation Studies on the ImageNet Dataset
  • (R1 and R2): HFF improves classification performance by about 1.5% while having no impact on the network’s complexity.

4.3. Impact of long-range shortcut connections with the input

  • (R2 and R3): These connections are effective and efficient, improving the performance by about 1% with a little (or negligible) impact on network’s complexity.

4.4. Fixed vs cyclic learning schedule

  • (R3 and R4): With cyclic learning schedule, the ESPNetv2 network achieves about 1% higher top-1 validation accuracy on the ImageNet dataset; suggesting that cyclic learning schedule allows to find a better local minima than fixed learning schedule.
  • (R4 and R5): Further, when training ESPNetv2 network for longer (300 epochs), performance improved by about 4%.

5. Experimental Results

5.1. Image Classification

Performance comparison of different efficient networks on the ImageNet validation set
  • Compared to MobileNets, ESPNetv2 delivers better performance especially under small computational budgets. With 28 million FLOPs, ESPNetv2 outperforms MobileNetV1 (34 million FLOPs) and MobileNetV2 (30 million FLOPs) by 10% and 2% respectively.
  • ESPNetv2 delivers comparable accuracy to ShuffleNet V2 without any channel split.
  • ESPNetv2 delivered better performance of 1.1% more accurate than the CondenseNet.
Performance analysis of different efficient networks (computational budget  300 million FLOPs)
  • The inference speed of ESPNetv2 is slightly lower than the fastest network (ShuffleNet V2) on both devices, however, it is much more power efficient while delivering similar accuracy on the ImageNet dataset.
  • This suggests that ESPNetv2 network has a good trade-off between accuracy, power consumption, and latency; a much desirable property for any network running on edge devices.

5.2. Multi-label Classification

Performance improvement in F1-score of ESPNetv2 over ShuffleNetv2 on MS-COCO multi-object classification task
  • ESPNetv2 outperforms ShuffleNet V2 by a large margin, especially when tested at image resolution of 896×896; suggesting large effective receptive fields of the EESP unit help ESPNetv2 learn better representations.

5.3. Semantic Segmentation

Semantic segmentation results on (a) the Cityscapes dataset and (b) the PASCAL VOC 2012 dataset
  • Under the similar computational constraints, ESPNetv2 outperforms existing methods like ENet and ESPNet by large margin.
  • Notably, ESPNetv2 is 2–3% less accurate than other efficient networks such as ICNet, ERFNet, and ContextNet, but has 9−12× fewer FLOPs.

5.4. Object Detection

Object detection results on the PASCAL VOC 2007 and the MS-COCO dataset
  • ESPNetv2 delivers the same performance as YOLOv2, but has 25× fewer FLOPs.
  • Compared to SSD, ESPNetv2 delivers a very competitive performance while being very efficient.

5.5. Language Modeling

Single model word-level perplexity on test set of the Penn Treebank dataset. (Lower perplexity value represents better performance)
  • The EESP unit is put inside the LSTM cell for processing the input vector.
  • The 2D convolutions are replaced with 1D convolutions in the EESP unit.
  • This model is called ERU (Efficient Recurrent Unit).
  • 3-layers of ERU with an embedding size of 400 are used.
  • ERUs deliver similar (only 1 point less than PRU) or better performance than state-of-the-art recurrent networks while learning fewer parameters.
  • The smallest language model of ERU with 7 million parameters outperforms most of state-of-the-art language models.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store