[Paper] MobileNetV3: Searching for MobileNetV3 (Image Classification)

In this story, Searching for MobileNetV3, by Google AI, and Google Brain, is presented. In this paper:

  • MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS), i.e. MnasNet, complemented by the NetAdapt algorithm.
  • Then it is subsequently improved through novel architecture advances.

This is a paper in 2019 ICCV with over 400 citations. (Sik-Ho Tsang @ Medium)


  1. MobileNetV3 Module
  2. Searching for MobileNetV3
  3. Experimental Results

1. MobileNetV3 Module

MobileNetV2 Module
MobileNetV3 Module

1.1. SE Module

  • Compared with MobileNetV2, MobileNetV3 has inserted the Squeeze and Excitation (SE) module, which is originated in SENet.
  • Hard-sigmoid is used to replace sigmoid in SE module for efficient computation.

1.2. Non Linearity

  • Swish is used to replace ReLU in literature to improve the non linearity.
  • However, it is no efficient for mobile hardware.
  • Hard-Swish (H-Swish) is used:

2. Searching for MobileNetV3

2.1. MnasNet

  • MnasNet is used as hardware-aware network architecture search (NAS).
  • (If interested, please feel free to read MobileNetV3.)

2.2. NetAdapt

  • After searching the network using MnasNet, NetAdapt is used to simply the model.
  1. Generate a set of new proposals. Each proposal represents a modification of an architecture that generates reduction in latency compared to the previous step.
  2. Fine-tune each proposal for T steps to get a coarse estimate of the accuracy.
  3. Selected best proposal according to some metric.
  • Iterate above steps until target latency is reached.
  • (If interested, please feel free to read NetAdapt.)
  • A modification against NetAdapt is done by picking one that maximizes: ΔAcc/|Δlatency| , with Δlatency satisfying the constraint in 1.

2.3. Redesigning Expensive Layers

Comparison of original last stage (top) and efficient last stage (bottom)
  • Some of the last layers as well as some of the earlier layers are more expensive than others.
  • In original network, 1×1 convolution as a final layer in order to expand from 320 dimensions to a higher-dimensional feature space (1280 dimensions). This layer is critically important in order to have rich features for prediction. However, this comes at a cost of extra latency.
  • The first modification is to move this layer past the final average pooling. This final set of features is now computed at 1×1 spatial resolution instead of 7×7 spatial resolution.
  • Once the cost of this feature generation layer has been mitigated, the previous bottleneck projection layer is no longer needed to reduce computation.
  • This efficient last stage reduces the latency by 7 milliseconds which is 11% of the running time and reduces the number of operations by 30 millions MAdds with almost no loss of accuracy.
  • Another expensive layer is the initial set of filters. Originally, models tend to use 32 filters in a full 3×3 convolution to build initial filter banks for edge detection.
  • With reducing the number of filters and using different nonlinearities can reduce redundancy. By using H-Swish, the number of filters is reduced to 16 while maintaining the same accuracy as 32 filters using either ReLU or swish. This saves an additional 2 milliseconds and 10 million MAdds.

2.4. MobileNetV3-Large and MobileNetV3-Small

  • SE denotes whether there is a Squeeze-And-Excite in that block. NL denotes the type of nonlinearity used. HS denotes h-swish and RE denotes ReLU. NBN denotes no batch normalization. s denotes stride.
  • MobileNetV3-Large and MobileNetV3-Small, are trained to be targeted at high and low resource use cases respectively, as shown above.
  • The models are created through applying platform-aware NAS (MnasNet) and NetAdapt for network search and incorporating the network improvements.

3. Experimental Results

3.1. MobileNetV3 Development

MobileNetV3 Developement
  • Improvements are shown by adding each component, as shown above.

3.2. ImageNet

Performance on ImageNet
Floating point performance
Quantized performance
  • From the floating point performance, the quantized performance, and also the detailed comparison with different versions of MobileNetV2, MobileNetV3 still outperforms with better tradeoff between accuracy and latency.

3.3. Object Detection

Object detection results of SSDLite with different backbones on COCO test set
  • MobileNetV3 is used as a drop-in replacement for the backbone feature extractor in SSDLite.
  • With the channel reduction, MobileNetV3-Large is 27% faster than MobileNetV2 with near identical mAP.
  • MobileNetV3-Small with channel reduction is also 2.4 and 0.5 mAP higher than MobileNetV2 and MnasNet while being 35% faster.

3.4. Semantic Segmentation

Building on MobileNetV3, the proposed segmentation head, Lite R-ASPP
  • R-ASPP is a reduced design of the Atrous Spatial Pyramid Pooling module, which adopts only two branches consisting of a 1×1 convolution and a global-average pooling operation.
  • Lite R-ASPP, improving over R-ASPP, deploys the global-average pooling in a fashion similar to the Squeeze-and-Excitation module.
  • Atrous convolution is applied to the last block of MobileNetV3 to extract denser features, and further a skip connection is added.
Semantic segmentation results on Cityscapes val set
  • RF2: Reduce the Filters in the last block by a factor of 2.
  • SH: Segmentation Head.
  • MobileNetV3-Small is significantly better than MobileNetV2–0.35 while yielding similar speed.
Semantic segmentation results on Cityscapes test set
  • MobileNetV3 outperforms ESPNetv2, C3 (CCC2), and ESPNetv1 by 6.4%, 10.6%, 12.3%, respectively while being faster in terms of MAdds.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store