[Paper] MnasNet: Platform-Aware Neural Architecture Search for Mobile (Image Classification)

Accuracy vs. Latency Comparison, MnasNet models significantly outperforms other mobile models
  • To further strike the right balance between flexibility and search space size, a novel factorized hierarchical search space that encourages layer diversity is proposed.


  1. Problem Formulation
  2. Mobile Neural Architecture Search
  3. Experimental Results

1. Problem Formulation

1.1. Consider Direct Real-World Inference Latency

An Overview of Platform-Aware Neural Architecture Search for Mobile
  • Unlike many neural architecture search (NAS) algorithms that often optimize for indirect metrics, such as FLOPS, direct real-world inference latency is considered, by running CNN models on real mobile devices.

1.2. Hard Constraint vs Soft Constraint

  • Given a model m, let ACC(m) denote its accuracy on the target task, LAT(m) denotes the inference latency on the target mobile platform, and T is the target latency.
  • Hard Constraint: A common method is to treat T as a hard constraint and maximize accuracy under this constraint:
  • Soft Constraint: A customized weighted product method to approximate Pareto optimal solutions, with optimization goal:
  • Reward(M2) = a(1+5%)(2l/T)^β ≈ Reward(M1) = a(l/T)^β.
  • Solving this gives β≈-0.07. Therefore, α=β=-0.07 are used in the experiments.
Top: Hard Constraint, Bottom: Soft Constraint
  • Bottom (Soft Constraint): It smoothly adjusts the objective value based on the measured latency.

2. Mobile Neural Architecture Search

2.1. Factorized Hierarchical Search Space

  • In previous SOTA approaches, they only search for a few complex cells and then repeatedly stack the same cells.
  • In MnasNet, a novel factorized hierarchical search space is used that factorizes a CNN model into unique blocks and then searches for the operations and connections per block separately, thus allowing different layer architectures in different blocks.
Factorized Hierarchical Search Space.
  • Each block has a list of identical layers, whose operations and connections are determined by a per-block sub search space.
  • A sub search space is:
  1. Convolutional kernel size KernelSize: 3×3, 5×5.
  2. Squeeze-and-excitation, used in SENet, with ratio SERatio: 0, 0.25.
  3. Skip ops SkipOp: pooling, identity residual, or no skip.
  4. Output filter size Fi.
  5. Number of layers per block Ni.
  • All search options are discretized using MobileNetV2 as a reference: For #layers in each block, {0, +1, -1} are searched based on MobileNetV2. For filter size per layer, its relative size in {0.75, 1.0, 1.25} are searched to MobileNetV2.
  • A typical case is S = 432, B = 5, N = 3, where the factorized search space size by MnasNet is about 10¹³, versing the per-layer approach with search space size 10³⁹.

2.2. Search Algorithm

An Overview of Platform-Aware Neural Architecture Search for Mobile
  • For each sampled model m, train it on the target task to get its accuracy ACC(m), and run it on real phones to get its inference latency LAT(m). We then calculate the reward value R(m) using the above equation.
  • (if interested, please feel free to read NASNet.)
  • The architecture search is performed on the ImageNet training set but with fewer training steps (5 epochs).
  • it takes 4.5 days on 64 TPUv2 devices.
  • During training, the real-world latency of each sampled model by running it on the single-thread big CPU core of Pixel 1 phones.
  • In total, the controller samples about 8K models during architecture search, but only 15 top-performing models are transferred to the full ImageNet and only 1 model is transferred to COCO.

3. Experimental Results

3.1. ImageNet Classification Performance

Performance Results on ImageNet Classification
  • Three top-performing MnasNet models are picked, with different latency-accuracy trade-offs from the same search experiment and compare them with existing mobile models.
  • MnasNet A1 model achieves 75.2% top-1 / 92.5% top-5 accuracy with 78ms latency and 3.9M parameters / 312M multiply-adds, achieving a new SOTA accuracy for this typical mobile latency constraint.
  • MnasNet runs 1.8× faster than MobileNetV2 (1.4) on the same Pixel phone with 0.5% higher accuracy.
  • Compared with automatically searched CNN models, MnasNet runs 2.3× faster than the mobile-size NASNet-A with 1.2% higher top-1 accuracy.
  • The slightly larger MnasNet-A3 model achieves better accuracy than ResNet-50, but with 4.8× fewer parameters and 10× fewer multiply-add cost.
  • it also outperforms AmoebaNet, PNASNet, ShuffleNet V2, ShuffleNet V1, CondenseNet, MobileNetV1, and SqueezeNext.
MnasNet with/without SE

3.2. Model Scaling Performance

Performance Comparison with Different Model Scaling Techniques
  • Similarly, MnasNet model is also robust to input size changes and consistently outperforms MobileNetV2 (increasing accuracy by up to 4.1%) across all input image sizes from 96 to 224.
Model Scaling vs. Model Search
  • As shown above, the accuracy is further improved with a new architecture search targeting a 22ms latency constraint.

3.3. COCO Object Detection

Performance Results on COCO Object Detection
  • MnasNet model achieves comparable mAP quality (23.0 vs 23.2) as SSD300 with 7.4× fewer parameters and 42× fewer multiply-adds.

3.4. Soft vs. Hard Latency Constraint

(a) Hard Constraint, (b) Soft Constraint
  • (b): It samples more models around the target latency value at 75ms, but also explores models with latency smaller than 40ms or greater than 110ms.

3.5. Disentangling Search Space and Reward

Comparison of Decoupled Search Space and Reward Design — Multi-obj
  • Multi-obj: i.e. Soft Constraint.
  • Cell-based: i.e. NAS in NASNet without the use of Factorized Hierarchical Search Space.
  • With only Multi-obj, lower latency is achieved but top-1 accuracy is also reduced.
  • With both Multi-obj and MnasNet, top-1 accuracy is improved with lower latency.

3.6. MnasNet Architecture and Layer Diversity

MnasNet-A1 Architecture
  • MnasNet uses both 3×3 and 5×5 convolutions, which is different from previous mobile models that all only use 3×3 convolutions.
Performance Comparison of MnasNet and Its Variants
  • As shown above, the original MnasNet model has much better accuracy-latency trade-offs than those variants



PhD, Researcher. I share what I learn. :) Reads: https://bit.ly/33TDhxG, LinkedIn: https://www.linkedin.com/in/sh-tsang/, Twitter: https://twitter.com/SHTsang3

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store