Reading: ShuffleNet V2 — Practical Guidelines for Efficient CNN Architecture Design (Image Classification)

Outperforms MobileNetV1, MobileNetV2, ShuffleNet V1, DenseNet, CondenseNet, Xception, IGCV2, IGCV3, NASNet-A, PNASNet-5, SENet & ResNet

Sik-Ho Tsang
9 min readOct 4, 2020

In this story, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design” (ShuffleNet V2), by Megvii Inc (Face++), and Tsinghua University, is presented.

  • Indirect Metric: Currently, the neural network architecture design is mostly guided by the indirect metric of computation complexity, i.e., FLOPs.
  • Direct Metric: e.g., speed, depends on the other factors such as memory access cost (MAC) and platform (GPU or ARM) characteristics.

In this paper:

  • Practical guidelines are suggested for efficient network design using MAC and experimental observations using GPU/ARM.
  • ShuffleNet V2 is proposed according to the practical guidelines, which obtains high accuracy with also high speed as shown above.

This is a paper in 2018 ECCV with over 700 citations. (Sik-Ho Tsang @ Medium)


  1. FLOPs vs Time and Latency
  2. Four Practical Guidelines for Efficient Architecture
  3. ShuffleNet V2: an Efficient Architecture
  4. Experimental Results

1. FLOPs vs Time and Latency

Batches/sec versus MFLOPS using GPU/ARM
  • FLOPs is an indirect metric. It is an approximation of, but usually not equivalent to the direct metric that we really care about, such as speed or latency.
  • For example, MobileNetV2 is much faster than NASNet-A but they have comparable FLOPs.
  • And as shown above, networks with similar FLOPs have different speeds. Using FLOPS as metric could lead to sub-optimal design.

1.1. Other Factors About Network Design

  • Besides FLOPs, one such factor is memory access cost (MAC). Such cost constitutes a large portion of runtime in certain operations like group convolution. It could be bottleneck on design.
  • Another one is degree of parallelism. A model with high degree of parallelism could be much faster than another one with low degree of parallelism, under the same FLOPs.

1.2. Platform Dependent

  • Same FLOPs could have different running time, depending on the platform.
  • Tensor decomposition is even slower on GPU although it reduces FLOPs by 75%.
  • The latest CuDNN library is specially optimized for 3×3 conv. We cannot certainly think that 3×3 conv is 9 times slower than 1×1 conv.

Four guidelines are derived according to the above principles.

2. Four Practical Guidelines for Efficient Architecture

2.1. Guideline 1 (G1): Equal channel width minimizes memory access cost (MAC)

  • The modern networks usually adopt depthwise separable convolutions where the pointwise convolution (i.e., 1×1 convolution) accounts for most of the complexity.
  • The number of input channels c1 and output channels c2. Let h and w be the spatial size of the feature map, the FLOPs of the 1×1 convolution is B = hwc1c2.
  • Thus, the memory access cost (MAC), or the number of memory access operations, is MAC = hw(c1+c2)+c1c2.
  • From mean value inequality:
  • Therefore, MAC has a lower bound given by FLOPs. It reaches the lower bound when the numbers of input and output channels are equal.
Validation experiment for Guideline 1.
  • A benchmark network is built by stacking 10 building blocks repeatedly. Each block contains two convolution layers. The first contains c1 input channels and c2 output channels, and the second otherwise.
  • The above reports the running speed by varying the ratio c1 : c2 while fixing the total FLOPs. It is clear that when c1 : c2 is approaching 1 : 1, the MAC becomes smaller and the network evaluation speed is faster.

2.2. Guideline 2 (G2): Excessive group convolution increases MAC

  • Group convolution reduces the computational complexity (FLOPs) by changing the dense convolution between all channels to be sparse.
  • However, the increased number of channels results in more MAC.
  • The relation between MAC and FLOPs for 1×1 group convolution is:
  • where g is the number of groups and B = hwc1c2/g is the FLOPs.
  • Given the fixed input shape ch×w and the computational cost B, MAC increases with the growth of g.
Validation experiment for Guideline 2.
  • The above table shows different group numbers while fixing the total FLOPs. Large group number decreases running speed significantly.
  • Using 8 groups is more than two times slower than using 1 group (standard dense convolution) on GPU and up to 30% slower on ARM. This is mostly due to increased MAC.

Authors suggest that the group number should be carefully chosen based on the target platform and task. It is unwise to use a large group number simply because this may enable using more channels, because the benefit of accuracy increase can easily be outweighed by the rapidly increasing computational cost.

2.3. Guideline 3 (G3): Network fragmentation reduces degree of parallelism

(a) 1-fragment. (b) 2-fragment-series. (c) 4-fragment-series. (d) 2-fragment-parallel. (e) 4-fragment-parallel.
  • “Multi-path” structure is widely adopted in each network block. A lot of small operators (called “fragmented operators” here) are used instead of a few large ones.
  • For example, in NASNet-A, the number of fragmented operators (i.e. the number of individual convolution or pooling operations in one building block) is 13. In ResNet, this number is 2 or 3.
  • Though such fragmented structure has been shown beneficial for accuracy, it could decrease efficiency because it is unfriendly for devices with strong parallel computing powers like GPU. It also introduces extra overheads such as kernel launching and synchronization.
Validation experiment for Guideline 3.
  • The above table shows that the fragmentation reduces the speed significantly on GPU, e.g. 4-fragment structure is 3× slower than 1-fragment. On ARM, the speed reduction is relatively small.

2.4. Guideline 4 (G4): Element-wise operations are non-negligible

Run time decomposition
  • As shown in the above figure, element-wise operations, include ReLU, AddTensor, AddBias, occupy considerable amount of time, especially on GPU.
  • They have small FLOPs but relatively heavy MAC. Specially, depthwise convolution is also considered as an element-wise operator as it also has a high MAC/FLOPs ratio.
Validation experiment for Guideline 4.
  • Let consider the “bottleneck” unit (1×1 conv followed by 3×3 conv followed by 1×1 conv, with ReLU and shortcut connection) in ResNet.
  • Around 20% speedup is obtained on both GPU and ARM, after ReLU and shortcut are removed.

2.5. Practical Guideline Summary

Both pointwise group convolutions and bottleneck structures increase MAC (G1 and G2). This cost is non-negligible, especially for light-weight models.

Also, using too many groups violates G3.

The element-wise “Add” operation in the shortcut connection is also undesirable (G4).

  • For example, ShuffleNet V1 heavily depends group convolutions (against G2) and bottleneck-like building blocks (against G1).
  • MobileNetV2 uses an inverted bottleneck structure that violates G1. It uses depthwise convolutions and ReLUs on “thick” feature maps. This violates G4.
  • The auto-generated structures, like NASNet-A and PNASNet, are highly fragmented and violate G3.


An efficient network architecture should 1) use “balanced” convolutions (equal channel width); 2) be aware of the cost of using group convolution; 3) reduce the degree of fragmentation; and 4) reduce element-wise operations.

3. ShuffleNet V2: an Efficient Architecture

Building blocks of ShuffleNet V1 and this work. (a): the basic ShuffleNet V1 unit; (b) the ShuffleNet V1 unit for spatial down sampling (2×); (c) the ShuffleNet V2 basic unit; (d) the ShuffleNet V2 unit for spatial down sampling (2×). DWConv: depthwise convolution. GConv: group convolution.
  • (a) & (b): They are basic unit and spatial downsampling unit in ShuffleNet V1. (If interested, please read ShuffleNet V1.)
  • (c) & (d): They are basic unit and spatial downsampling unit in ShuffleNet V2.
  • (c): A simple operator called channel split is used.
  • At the beginning of each unit, the input of c feature channels are split into two branches with c-c0 and c0 channels, respectively. (c0=c/2 for simplicity.)
  • Following G3, one branch remains as identity. The other branch consists of three convolutions with the same input and output channels to satisfy G1.
  • The two 1×1 convolutions are no longer group-wise. This is partially to follow G2 because the split operation already produces two groups.
  • After convolution, the two branches are concatenated. So, the number of channels keeps the same (G1).
  • The same “channel shuffle” operation as in ShuffleNet V1 is then used to enable information communication between the two branches.
  • After the shuffling, the next unit begins. Note that the “Add” operation in ShuffleNet V1 no longer exists. Element-wise operations like ReLU and depth-wise convolutions exist only in one branch.
  • Also, the three successive element-wise operations, “Concat”, “Channel Shuffle” and “Channel Split”, are merged into a single element-wise operation. These changes are beneficial according to G4.
  • (d): For spatial downsampling unit, the channel split operator is removed. Thus, the number of output channels is doubled.
Overall architecture of ShuffleNet V2, for four different levels of complexities.
  • One difference from ShuffleNet V1: An additional 1×1 convolution layer is added right before global averaged pooling to mix up features, which is absent in ShuffleNet V1.
  • The number of channels in each block is scaled to generate networks of different complexities, marked as 0.5×, 1×, 1.5×, and 2×.
  • Similar to DenseNet and CondenseNet: Half of feature channels directly go through the block and join the next block. This can be regarded as a kind of feature reuse.

4. Experimental Results

Classification Error on ImageNet Validation Set

4.1. Accuracy vs. FLOPs

4.2. Inference Speed vs. FLOPs/Accuracy

  • On GPU, for example, at 500MFLOPs ShuffleNet V2 is 58% faster than MobileNetV2, 63% faster than ShuffleNet V1 and 25% faster than Xception.
  • On ARM, the speeds of ShuffleNet V1, Xception and ShuffleNet V2 are comparable; however, MobileNetV2 is much slower, especially on smaller FLOPs. This is because MobileNetV2 has higher MAC (G1 and G4), which is significant on mobile devices.
  • IGCV2 and IGCV3 are slow. This is due to usage of too many convolution groups (G2).
  • Auto-generated models, NASNet & PNASNet, are slow, due to the usage of too many fragments (G3).

4.3. Compatibility with Other Methods

(a) ShuffleNet V2 with residual. (b) ShuffleNet V2 with SE. (c) ShuffleNet V2 with SE and residual.
  • When equipped with Squeeze-and-excitation (SE) module used in SENet, the classification accuracy of ShuffleNet V2 is improved by 0.5% at the cost of certain loss in speed, as shown in the above big table.

4.4. Generalization to Large Models

  • The basic ShuffleNet V2 unit can add a residual path used in ResNet and also SE module used in SENet.
  • A ShuffleNet V2 model of 164 layers equipped with SE modules. It obtains superior accuracy over the previous state-of-the-art models like ResNet and SENet with much fewer FLOPs.
  • (The detailed architecture can be viewed in the Appendix of paper.)

4.5. Object Detection

  • COCO object detection task is used for comparison using the Light-Head R-CNN.
  • ShuffleNet V2 performs the best.
  • By enlarging the receptive field of ShuffleNet V2 using an additional 3×3 depthwise convolution before the first pointwise convolution in each building block. This variant is denoted as ShuffleNet V2*. With only a few additional FLOPs, it further improves accuracy.
  • Furthermore, the variant ShuffleNet V2* has best accuracy and is still faster than other methods.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.