[Paper] ProxylessNAS: Direct Neural Architecture Search on Target Task (Image Classification)

Outperforms DARTS, MnasNet, AmoebaNet, ENAS, NASNet, ShuffleNet V2, MobileNetV2, PyramidNet, Shake-Shake & DenseNet.

Sik-Ho Tsang
8 min readNov 14, 2020
Left: Conventional NAS, Right: ProxylessNAS

In this story, ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware (ProxylessNAS), by Massachusetts Institute of Technology (MIT), is presented.

In conventional NAS:

  • The prohibitive computational demand (e.g. 10⁴ GPU hours) makes it difficult to directly search the architectures on large-scale tasks such as ImageNet.
  • They need to utilize proxy tasks, such as training on a smaller dataset, then use the learned cells on large-scale target tasks.
  • However, architectures optimized on proxy tasks are not guaranteed to be optimal on the target task.

In this paper:

  • ProxylessNAS that can directly learn the architectures for large-scale target tasks and target hardware platforms, to address the high memory consumption issue of differentiable NAS and reduce the computational cost.

This is a paper in 2019 ICML with over 500 citations. (Sik-Ho Tsang @ Medium)


  1. Memory Problems in DARTS
  2. Learning Binarized Path
  3. Training Binarized Architecture Parameters
  4. Making Latency Differentiable
  5. Models Learnt for GPU/CPU/Mobile
  6. Experimental Results

1. Memory Problems in DARTS

Learning both weight parameters and binarized architecture parameters
  • A neural network is denoted as N(e,…,en) where ei represents a certain edge in the directed acyclic graph (DAG).
  • Let O = {oi} be the set of N candidate primitive operations (e.g. conv)
  • Instead of setting each edge to be a definite primitive operation, each edge is set to be a mixed operation that has N parallel paths as shown above.
  • Given input x, the output of a mixed operation mO is defined based on the outputs of its N paths.
  • In DARTS, mO(x) is weighted sum of {oi(x)}. The weights are calculated by applying softmax to N real-valued architecture parameters {αi}:
  • Thus, DARTS roughly need N times GPU memory and GPU hours compared to training a compact model. On large scale dataset, this can easily exceed the memory limits of hardware with large design space.

In this paper, authors solve this memory issue based on the idea of path binarization.

2. Learning Binarized Path

  • To reduce memory footprint, only one path is kept when training the over-parameterized network.
  • The N real-valued architecture parameters {αi} are transformed into binary gates:
  • Based on the binary gates g, the output of the mixed operation is given as:

By using the binary gates rather than real-valued path weights, only one path of activation is active in memory at run-time and the memory requirement of training the over-parameterized network is thus reduced to the same level of training a compact model.

That’s more than an order of magnitude memory saving.

3. Training Binarized Architecture Parameters

3.1. General Steps

  1. When training weight parameters, the architecture parameters are freezed and binary gates are stochastically sampled.
  2. When training architecture parameters, the weight parameters are
    , then the binary gates are reset and the architecture parameters are updated on the validation set.
  • The above 2 steps are alternatively performed.
  • Once the training of architecture parameters is finished, the compact architecture is derived by pruning redundant paths, by simply choosing the path with the highest path weight.

3.2. Update Step (Step 2) of the Architecture Parameters

  • Within an update step of the architecture parameters, two paths are sampled and all the other paths are masked as if they do not exist.
  • As such, the number of candidates temporarily decrease from N to 2.
  • Then, the architecture parameters of these two sampled paths are updated using the gradients.
  • As such, in each update step, one of the sampled paths is enhanced (path weight increases) and the other sampled path is attenuated (path weight decreases) while all other paths keep unchanged.

In this way, regardless of the value of N, only two paths are involved in each update step of the architecture parameters, and thereby the memory requirement is reduced to the same level of training a compact model.

By doing so, ProxylessNAS can directly learn the architectures on the large-scale target task on large-scale ImageNet.

4. Making Latency Differentiable

4.1. Mobile Latency Modelling

  • Measuring the latency on-device is accurate but not ideal for 2 reasons.
  1. Slow: It takes about 20ms for one inference.
  2. Expensive: A lot of mobile devices and software engineering work are required to build an automatic pipeline to gather the latency from a mobile farm.
  • It is better to build a model to estimate the latency.
  • 5k architectures are sampled from a candidate space, where 4k architectures are used to build the latency model and the rest are used for test.
  • The latency is measured on Google Pixel 1 phone using TensorFlow-Lite.
  • The features include (i) type of the operator (ii) input and output feature map size (iii) other attributes like kernel size, stride for convolution and expansion ratio.
The latency RMSE is 0.75ms.
  • A strong correlation is observed between the predicted latency and real measured latency on the test set, suggesting that the latency prediction model can be used to replace the expensive mobile farm infrastructure.

4.2. Latency Regularization Loss

Latency regularization loss.
  • After that, we can get the expected latency of a mixed operation (i.e. a learnable block) as:
  • where E[latencyi] is the expected latency of the ith learnable block, F(.) denotes the latency prediction model.
  • For the whole model, the expected latency of the network can be expressed with the sum of these mixed operations’ expected latencies:
  • Therefore, the loss function becomes:
  • where the scaling factor λ2(>0) can control the trade-off between accuracy and latency. LossCE denotes the cross-entropy loss and ||w||22 is the weight decay term
  • As an alternative to BinaryConnect, REINFORCE-based approach can also be used. Consider a network that has binarized parameters , the goal of updating binarized parameters is to find the optimal binary gates g that maximizes a certain reward. (I am not expert on reinforcement learning, I won’t go into deep about it.)
  • But now, there are 2 approaches to update the network. One is Proxyless-G (Gradient). One is Proxyless-R (Reinforcement).

5. Models Learnt Using GPU/CPU/Mobile

Efficient models optimized for different hardware (Models at each epoch: https://hanlab.mit.edu/projects/proxylessNAS/test.mp4)
  • The above figure demonstrates the detailed architectures of searched CNN models on three hardware platforms: GPU/CPU/Mobile.
  • The GPU model is shallower and wider, especially in early stages where the feature map has higher resolution.
  • The GPU model prefers large MBConv operations (e.g. 7×7 MBConv6), while the CPU model would go for smaller MBConv operations. This is because GPU has much higher parallelism than CPU.
  • Another interesting observation is that the searched models on all platforms prefer larger MBConv operations in the first block within each stage where the feature map is downsampled. It might because larger MBConv operations are beneficial for the network to preserve more information when downsampling. Notably, such kind of patterns cannot be captured in previous NAS methods.

6. Experimental Results

6.1. CIFAR-10

Test error on CIFAR-10
  • c/o: means Cutout.
  • Specifically, Proxyless-G reaches a test error rate of 2.08% which is slightly better than AmoebaNet-B.
  • Notably, AmoebaNet-B uses 34.9M parameters while Proxyless-G/R only uses 5.7M parameters which is 6× fewer than AmoebaNet-B.
  • ProxylessNAS demonstrate the benefits of directly exploring a large architecture space instead of repeatedly stacking the same block.

6.2. ImageNet

Comparison with MobileNetV2
  • MobileNetV2 is used as backbone.
  • Specifically, rather than repeating the same mobile inverted bottleneck convolution (MBConv), a set of MBConv layers is allowed with various kernel sizes {3, 5, 7} and expansion ratios {3, 6}.
  • ProxylessNAS consistently outperforms MobileNetV2 under various latency settings.
  • MobileNetV2 has 143ms latency while ProxylessNAS model only needs 78ms (1.83× faster).
Accuracy on ImageNet
  • While compared with MnasNet, ProxylessNAS model can achieve 0.6% higher top-1 accuracy with slightly lower mobile latency.
  • More importantly, it is much more resource efficient. The GPU-hour is 200× fewer than MnasNet.
  • Also, Proxyless-G has no incentive to choose computation-cheap operations if were not for the latency regularization loss. It is essential to take latency as a direct objective.
Accuracy (%) and GPU latency (Tesla V100) on ImageNet
  • One GPU, ProxylessNAS can achieve superior performances compared to both human-designed and automatically searched architectures.
  • Specifically, compared to MobileNetV2 and MnasNet, ProxylessNAS model improves the top-1 accuracy by 3.1% and 1.1% respectively while being 1.2× faster.
Hardware prefers specialized models.
  • An interesting observation is that models optimized for GPU do not run fast on CPU and mobile phone, vice versa.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.