[Paper] MnasNet: Platform-Aware Neural Architecture Search for Mobile (Image Classification)

Outperforms AmoebaNet, PNASNet, NASNet, ShuffleNet V2, ShuffleNet V1, CondenseNet, MobileNetV1, ResNet, and SqueezeNext.

Sik-Ho Tsang
8 min readOct 24, 2020
Accuracy vs. Latency Comparison, MnasNet models significantly outperforms other mobile models

In this paper, Platform-Aware Neural Architecture Search for Mobile (MnasNet), by Google Brain, and Google Inc., is presented. In this paper:

  • Model latency is also considered during neural architecture search where the model latency is measured using real-world inference latency by executing the model on mobile phones.
  • To further strike the right balance between flexibility and search space size, a novel factorized hierarchical search space that encourages layer diversity is proposed.

This is a paper in 2019 CVPR with over 600 citations. (Sik-Ho Tsang @ Medium)


  1. Problem Formulation
  2. Mobile Neural Architecture Search
  3. Experimental Results

1. Problem Formulation

1.1. Consider Direct Real-World Inference Latency

An Overview of Platform-Aware Neural Architecture Search for Mobile
  • The aim is to find CNN models with both high accuracy and low inference latency.
  • Unlike many neural architecture search (NAS) algorithms that often optimize for indirect metrics, such as FLOPS, direct real-world inference latency is considered, by running CNN models on real mobile devices.

1.2. Hard Constraint vs Soft Constraint

  • Given a model m, let ACC(m) denote its accuracy on the target task, LAT(m) denotes the inference latency on the target mobile platform, and T is the target latency.
  • Hard Constraint: A common method is to treat T as a hard constraint and maximize accuracy under this constraint:
  • This approach only maximizes a single metric.
  • Soft Constraint: A customized weighted product method to approximate Pareto optimal solutions, with optimization goal:
  • where w is the weight factor defined as:
  • For instance, it is empirically observed doubling the latency usually brings about 5% relative accuracy gain. Given two models: (1) M1 has latency l and accuracy a; (2) M2 has latency 2l and 5% higher accuracy a(1+5%), they should have similar reward:
  • Reward(M2) = a(1+5%)(2l/T)^β ≈ Reward(M1) = a(l/T)^β.
  • Solving this gives β≈-0.07. Therefore, α=β=-0.07 are used in the experiments.
Top: Hard Constraint, Bottom: Soft Constraint
  • Top (Hard Constraint): It sharply penalizes the objective value to discourage models from violating latency constraints.
  • Bottom (Soft Constraint): It smoothly adjusts the objective value based on the measured latency.

Soft Constraint Approach is used in MnasNet.

2. Mobile Neural Architecture Search

2.1. Factorized Hierarchical Search Space

  • In previous SOTA approaches, they only search for a few complex cells and then repeatedly stack the same cells.
  • In MnasNet, a novel factorized hierarchical search space is used that factorizes a CNN model into unique blocks and then searches for the operations and connections per block separately, thus allowing different layer architectures in different blocks.

Thus, we can search for the best operations based on the input and output shapes to obtain better accurate latency trade-offs. For example, earlier stages of CNNs usually process larger amounts of data and thus have much higher impact on inference latency than later stages.

Factorized Hierarchical Search Space.
  • A CNN model is partitioned into a sequence of pre-defined blocks, gradually reducing input resolutions and increasing filter sizes as is common in many CNN models.
  • Each block has a list of identical layers, whose operations and connections are determined by a per-block sub search space.
  • A sub search space is:
  1. Convolutional ops ConvOp: regular conv (conv), depthwise conv (dconv), and mobile inverted bottleneck conv (MobileNetV2).
  2. Convolutional kernel size KernelSize: 3×3, 5×5.
  3. Squeeze-and-excitation, used in SENet, with ratio SERatio: 0, 0.25.
  4. Skip ops SkipOp: pooling, identity residual, or no skip.
  5. Output filter size Fi.
  6. Number of layers per block Ni.
  • For example, in the above figure, each layer of block 4 has an inverted bottleneck 5×5 convolution and an identity residual skip path, and the same layer is repeated N4 times.
  • All search options are discretized using MobileNetV2 as a reference: For #layers in each block, {0, +1, -1} are searched based on MobileNetV2. For filter size per layer, its relative size in {0.75, 1.0, 1.25} are searched to MobileNetV2.

Since, each block has different operations, the factorized hierarchical search space has a distinct advantage of balancing the diversity of layers and the size of total search space.

  • Suppose the network into is partitioned into B blocks, and each block has a sub search space of size S with average N layers per block, then the total search space size would be S^B, versing the flat per-layer search space with size S^(B×N).
  • A typical case is S = 432, B = 5, N = 3, where the factorized search space size by MnasNet is about 10¹³, versing the per-layer approach with search space size 10³⁹.

2.2. Search Algorithm

An Overview of Platform-Aware Neural Architecture Search for Mobile
  • The search algorithm is following the NAS in NASNet but using the R(m) is the objective value using the soft constraint defined equation in Section 1, R(m):
  • At each step, the controller first samples a batch of models using its current controller parameters θ, by predicting a sequence of tokens based on the softmax logits from its RNN.
  • For each sampled model m, train it on the target task to get its accuracy ACC(m), and run it on real phones to get its inference latency LAT(m). We then calculate the reward value R(m) using the above equation.
  • (if interested, please feel free to read NASNet.)
  • The architecture search is performed on the ImageNet training set but with fewer training steps (5 epochs).
  • it takes 4.5 days on 64 TPUv2 devices.
  • During training, the real-world latency of each sampled model by running it on the single-thread big CPU core of Pixel 1 phones.
  • In total, the controller samples about 8K models during architecture search, but only 15 top-performing models are transferred to the full ImageNet and only 1 model is transferred to COCO.

3. Experimental Results

3.1. ImageNet Classification Performance

Performance Results on ImageNet Classification
  • The target latency as T = 75ms for architecture search.
  • Three top-performing MnasNet models are picked, with different latency-accuracy trade-offs from the same search experiment and compare them with existing mobile models.
  • MnasNet A1 model achieves 75.2% top-1 / 92.5% top-5 accuracy with 78ms latency and 3.9M parameters / 312M multiply-adds, achieving a new SOTA accuracy for this typical mobile latency constraint.
  • MnasNet runs 1.8× faster than MobileNetV2 (1.4) on the same Pixel phone with 0.5% higher accuracy.
  • Compared with automatically searched CNN models, MnasNet runs 2.3× faster than the mobile-size NASNet-A with 1.2% higher top-1 accuracy.
  • The slightly larger MnasNet-A3 model achieves better accuracy than ResNet-50, but with 4.8× fewer parameters and 10× fewer multiply-add cost.
  • it also outperforms AmoebaNet, PNASNet, ShuffleNet V2, ShuffleNet V1, CondenseNet, MobileNetV1, and SqueezeNext.
MnasNet with/without SE
  • MnasNet without SE Module still significantly outperforms both MobileNetV2 and NASNet.

3.2. Model Scaling Performance

Performance Comparison with Different Model Scaling Techniques
  • MnasNet model consistently achieves better accuracy than MobileNetV2 for each depth multiplier.
  • Similarly, MnasNet model is also robust to input size changes and consistently outperforms MobileNetV2 (increasing accuracy by up to 4.1%) across all input image sizes from 96 to 224.
Model Scaling vs. Model Search
  • To scale down the model, we can either scale down a baseline model, or search for new models specifically targeted to this latency constraint.
  • As shown above, the accuracy is further improved with a new architecture search targeting a 22ms latency constraint.

3.3. COCO Object Detection

Performance Results on COCO Object Detection
  • SSDLite, in MobileNetV2, is used as the object detector.
  • MnasNet model achieves comparable mAP quality (23.0 vs 23.2) as SSD300 with 7.4× fewer parameters and 42× fewer multiply-adds.

3.4. Soft vs. Hard Latency Constraint

(a) Hard Constraint, (b) Soft Constraint
  • (a): The controller tends to focus more on faster models to avoid the latency penalty.
  • (b): It samples more models around the target latency value at 75ms, but also explores models with latency smaller than 40ms or greater than 110ms.

3.5. Disentangling Search Space and Reward

Comparison of Decoupled Search Space and Reward Design — Multi-obj
  • Single-obj: i.e. Hard Constraint.
  • Multi-obj: i.e. Soft Constraint.
  • Cell-based: i.e. NAS in NASNet without the use of Factorized Hierarchical Search Space.
  • With only Multi-obj, lower latency is achieved but top-1 accuracy is also reduced.
  • With both Multi-obj and MnasNet, top-1 accuracy is improved with lower latency.

3.6. MnasNet Architecture and Layer Diversity

MnasNet-A1 Architecture
  • As shown above, MnasNet-A1 model consists of a variety of layer architectures throughout the network.
  • MnasNet uses both 3×3 and 5×5 convolutions, which is different from previous mobile models that all only use 3×3 convolutions.
Performance Comparison of MnasNet and Its Variants
  • MnasNet with its variants that only repeat a single type of layer are also tried.
  • As shown above, the original MnasNet model has much better accuracy-latency trade-offs than those variants



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.