# [Paper] MnasNet: Platform-Aware Neural Architecture Search for Mobile (Image Classification)

## Outperforms AmoebaNet, PNASNet, **NASNet****, **ShuffleNet V2, ShuffleNet V1, CondenseNet, MobileNetV1, **ResNet****, **and SqueezeNext.

--

In this paper, **Platform-Aware Neural Architecture Search for Mobile (MnasNet)**, by Google Brain, and Google Inc., is presented. In this paper:

**Model latency is also considered**during neural architecture search where the model latency is measured using real-world inference latency by executing the model on mobile phones.- To further strike the right balance between flexibility and search space size,
**a novel factorized hierarchical search space that encourages layer diversity is proposed.**

This is a paper in **2019 CVPR **with over **600 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Problem Formulation****Mobile Neural Architecture Search****Experimental Results**

# 1. Problem Formulation

## 1.1. Consider Direct Real-World Inference Latency

- The aim is to find CNN models with both high accuracy and low inference latency.
- Unlike many neural architecture search (NAS) algorithms that often optimize for indirect metrics, such as FLOPS,
**direct real-world inference latency is considered, by running CNN models on real mobile devices.**

## 1.2. Hard Constraint vs Soft Constraint

- Given a model m, let
*ACC*(*m*) denote its accuracy on the target task,*LAT*(*m*) denotes the inference latency on the target mobile platform, and*T*is the target latency. **Hard Constraint:**A common method is to treat*T*as a**hard constraint**and maximize accuracy under this constraint:

- This approach
**only maximizes a single metric.** **Soft Constraint:**A customized weighted product method to**approximate Pareto optimal solutions, with optimization goal**:

- where
*w*is the weight factor defined as:

- For instance, it is empirically observed doubling the latency usually brings about 5% relative accuracy gain. Given two models: (1) M1 has latency
*l*and accuracy*a*; (2) M2 has latency 2*l*and 5% higher accuracy*a*(1+5%), they should have similar reward: - Reward(M2) =
*a*(1+5%)(2*l*/*T*)^*β*≈ Reward(M1) =*a*(l/*T*)^*β*. - Solving this gives
*β*≈-0.07. Therefore,*α*=*β*=-0.07 are used in the experiments.

**Top (Hard Constraint)**: It sharply penalizes the objective value to discourage models from violating latency constraints.**Bottom (Soft Constraint):**It smoothly adjusts the objective value based on the measured latency.

Soft Constraint Approach is used in MnasNet.

**2. Mobile Neural Architecture Search**

## 2.1. Factorized Hierarchical Search Space

- In previous SOTA approaches, they only search for a few complex cells and then repeatedly stack the same cells.
- In MnasNet, a novel factorized hierarchical search space is used that
**factorizes a CNN model into unique blocks and then searches for the operations and connections per block separately, thus allowing different layer architectures in different blocks.**

Thus, we can search for the best operations based on the input and output shapes to obtain better accurate latency trade-offs.

For example, earlier stages of CNNs usually process larger amounts of data and thus have much higher impact on inference latency than later stages.

**A CNN model is partitioned into a sequence of pre-defined blocks**,**gradually reducing input resolutions**and**increasing filter sizes**as is common in many CNN models.- Each block has a list of identical layers, whose
**operations and connections are determined by a per-block sub search space**. - A sub search space is:

- Convolutional ops
: regular conv (conv), depthwise conv (dconv), and mobile inverted bottleneck conv (MobileNetV2).*ConvOp* - Convolutional kernel size
: 3×3, 5×5.*KernelSize* - Squeeze-and-excitation, used in SENet, with ratio
: 0, 0.25.*SERatio* - Skip ops
: pooling, identity residual, or no skip.*SkipOp* - Output filter size
.*Fi* - Number of layers per block
.*Ni*

- For example, in the above figure, each layer of block 4 has an inverted bottleneck 5×5 convolution and an identity residual skip path, and the same layer is repeated
*N*4 times. - All search options are discretized using MobileNetV2 as a reference: For #layers in each block, {0, +1, -1} are searched based on MobileNetV2. For filter size per layer, its relative size in {0.75, 1.0, 1.25} are searched to MobileNetV2.

Since, each block has different operations, the factorized hierarchical search space has a distinct advantage of balancing the diversity of layers and the size of total search space.

- Suppose the network into is partitioned into
*B*blocks, and each block has a sub search space of size*S*with average*N*layers per block, then the total search space size would be*S*^*B*, versing the flat per-layer search space with size*S*^(*B*×*N*). **A typical case is**here*S*= 432,*B*= 5,*N*= 3, w**the factorized search space size by MnasNet is about 10¹³**,**versing the per-layer approach with search space size 10³⁹.**

## 2.2. Search Algorithm

**The search algorithm is following the NAS in****NASNet****but using the**is the objective value using the soft constraint defined equation in Section 1,*R*(*m*)*R*(*m*):

- At each step,
**the controller first samples a batch of models using its current controller parameters**, by predicting a sequence of tokens based on the softmax logits from its RNN.*θ* **For each sampled model**run it on real phones to get its*m*, train it on the target task to get its accuracy ACC(*m*), and**inference latency LAT(**. We then calculate the reward value*m*)*R*(*m*) using the above equation.- (if interested, please feel free to read NASNet.)
- The architecture search is performed on the ImageNet training set but with fewer training steps (5 epochs).
**it takes 4.5 days on 64 TPUv2 devices.**- During training,
**the real-world latency**of each sampled model by running it on the**single-thread big CPU core of Pixel 1 phones**. - In total,
**the controller samples about 8K models during architecture search, but only 15 top-performing models are transferred to the full ImageNet**and**only 1 model is transferred to COCO.**

**3. Experimental Results**

## 3.1. ImageNet Classification Performance

**The target latency as**for architecture search.*T*= 75ms**Three top-performing MnasNet models are picked**, with different latency-accuracy trade-offs from the same search experiment and compare them with existing mobile models.**MnasNet A1 model achieves 75.2% top-1 / 92.5% top-5 accuracy with 78ms latency and 3.9M parameters / 312M multiply-adds**, achieving a new SOTA accuracy for this typical mobile latency constraint.**MnasNet runs 1.8× faster than****MobileNetV2****(1.4)**on the same Pixel phone**with 0.5% higher accuracy.**- Compared with automatically searched CNN models,
**MnasNet runs 2.3× faster than the mobile-size****NASNet****-A with 1.2% higher top-1 accuracy.** - The slightly larger
**MnasNet-A3 model achieves better accuracy than****ResNet****-50, but with 4.8× fewer parameters and 10× fewer multiply-add cost.** - it also outperforms AmoebaNet, PNASNet, ShuffleNet V2, ShuffleNet V1, CondenseNet, MobileNetV1, and SqueezeNext.

**MnasNet without SE Module still significantly outperforms both****MobileNetV2****and****NASNet****.**

## 3.2. Model Scaling Performance

- MnasNet model consistently achieves better accuracy than MobileNetV2 for each depth multiplier.
- Similarly, MnasNet model is also robust to input size changes and consistently outperforms MobileNetV2 (increasing accuracy by up to 4.1%) across all input image sizes from 96 to 224.

- To scale down the model, we can either scale down a baseline model, or search for new models specifically targeted to this latency constraint.
- As shown above, the accuracy is further improved with a new architecture search targeting a 22ms latency constraint.

## 3.3. COCO Object Detection

- SSDLite, in MobileNetV2, is used as the object detector.
**MnasNet model achieves comparable mAP quality (23.0 vs 23.2) as****SSD****300 with 7.4× fewer parameters and 42× fewer multiply-adds.**

## 3.4. Soft vs. Hard Latency Constraint

**(a)**: The controller tends to focus more on faster models to avoid the latency penalty.**(b)**: It samples more models around the target latency value at 75ms, but also explores models with latency smaller than 40ms or greater than 110ms.

## 3.5. Disentangling Search Space and Reward

**Single-obj**: i.e. Hard Constraint.**Multi-obj**: i.e. Soft Constraint.**Cell-based**: i.e. NAS in NASNet without the use of Factorized Hierarchical Search Space.- With only
**Multi-obj**, lower latency is achieved but top-1 accuracy is also reduced. - With both
**Multi-obj and MnasNet**, top-1 accuracy is improved with lower latency.

## 3.6. MnasNet Architecture and Layer Diversity

- As shown above,
**MnasNet-A1 model consists of a variety of layer architectures throughout the network.** - MnasNet uses both 3×3 and 5×5 convolutions, which is different from previous mobile models that all only use 3×3 convolutions.

- MnasNet with its variants that only repeat a single type of layer are also tried.
- As shown above, the original MnasNet model has much better accuracy-latency trade-offs than those variants

## Reference

[2019 CVPR] [MnasNet]

MnasNet: Platform-Aware Neural Architecture Search for Mobile

## Image Classification

[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [Cutout] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [Deep Roots] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [AmoebaNet] [ESPNetv2] [MnasNet]