[Paper] FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (Image Classification)

In this paper, FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (FBNet), by UC Berkeley, Princeton University, and Facebook Inc, is presented. In this paper:

  • A differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures.
  • FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS outperforms SOTA approaches.

This is a paper in 2019 CVPR with over 300 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Search Space
  2. Latency-Aware Loss Function
  3. Differentiable Neural Architecture Search (DNAS)
  4. Experimental Results

1. Search Space

1.1. Fixed Macro Architecture and

  • A fixed macro architecture is defined.
  • The first and the last three layers of the network have fixed operators.
  • For the rest of the layers (TBS, To Be Searched), their block type needs to be searched.
  • f is filter size, n is number of blocks, s is the stride.

1.2. Searchable Micro Architecture

  • Each searchable layer in the network can choose a different block from the layer-wise search.
  • The block structure contains a point-wise (1×1) convolution, a K-by-K depthwise convolution where K denotes the kernel size, and another 1×1 convolution.
  • ReLU is used at the first 1x1 convolution and the depthwise convolution, but there are no ReLUs at the last 1×1 convolution.
  • If the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output.
  • The expansion ratio e, is used to determine how much do we expand the output channel size of the first 1×1 convolution compared with its input channel size.
  • A kernel size of 3 or 5 for the depthwise convolution can be chosen.
  • Group convolution cane be chosen to be used for the first and the last 1×1 convolution with the use of channel shuffle.
  • Finally, the layer-wise search space contains 9 candidate blocks, as shown above.
  • There is a block called “skip” without actual computations. This candidate block essentially allows us to reduce the depth of the network.
  • There are 1+4+4+4+4+4+1=22 TBS blocks, so it contains 9²² ≈ 10²¹ possible architectures. To search the architecture, it is a non-trivial task.

2. Latency-Aware Loss Function

  • The below loss function is used:
  • The first term CE(a,wa) denotes the cross-entropy loss of architecture a with parameter wa.
  • The second term LAT(a) denotes the latency of the architecture on the target hardware measured in micro-second. The coefficient α controls the overall magnitude of the loss function. The exponent coefficient β modulates the magnitude of the latency term.
  • A latency lookup table model is used to estimate the overall latency of a network based on the runtime of each operator.
  • where b(a)l denotes the block at layer-l from architecture a.
  • By benchmarking the latency of a few hundred operators used in the search space, we can easily estimate the actual runtime of the 10²¹ architectures in the entire search space.
  • Also, using the lookup table model makes the latency term in the loss function differentiable.

3. Differentiable Neural Architecture Search (DNAS)

  • During the inference of the super net, only one candidate block is sampled and executed with the sampling probability of:
  • where θl contains parameters that determine the sampling probability of each block at layer-l. i is the i-th block.
  • Equivalently, the output of layer-l can be expressed as:
  • Let each layer sample independently, therefore, the probability of sampling an architecture a can be described as:
  • where θ denotes the a vector consists of all the θl,i.
  • The problem is relaxed to optimize the probability  of the stochastic super net to achieve the minimum expected loss:
  • The discrete mask variable ml,i is relaxed to be a continuous random variable computed by the Gumbel Softmax function:
  • As τ approaches 0, it approximates the discrete categorical sampling following the distribution. As τ becomes larger, ml;i becomes a continuous random variable.
  • And the latency of the architecture a is:
  • Thus, the overall latency of architecture a is differentiable.
  • The search process is now equivalent to training the stochastic super net.
  • After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution .

In each epoch, first train the operator weights wa and then the architecture probability parameter θ.

After the search finishes, several architectures are sampled from the trained distribution , and train them from scratch.

  • The proposed DNAS algorithm is orders of magnitude faster than previous RL based NAS.

4. Experimental Results

4.1. ImageNet

  • Samsung Galaxy S8 with a Qualcomm Snapdragon 835 platform is targeted.
  • wa is trained on 80% of ImageNet training set using SGD with momentum. The architecture distribution parameter θ is trained on the rest 20% of ImageNet training set with Adam.
  • In the first group, FBNet-A achieves 73.0% accuracy, better than 1.0-MobileNetV2 (+1.0%), 1.5-ShuffleNet V2 (+0.4%), and CondenseNet(+2%), and are on par with DARTS and MnasNet-65.
  • Regarding latency, FBNet-A is 1.9 ms (relative 9.6%), 2.2 ms (relative 11%), and 8.6 ms (relative 43%) better than the MobileNetV2, ShuffleNet V2, and CondenseNet counterparts.
  • FBNet-A’s FLOP count is only 249M, 50M smaller (relative 20%) than MobileNetV2 and ShuffleNet V2, 20M (relative 8%) smaller than MnasNet, and 2.4× smaller than DARTS.
  • In the second group, FBNet-B achieves comparable accuracy with 1.3-MobileNetV2, but the latency is 1.46× lower, and the FLOP count is 1.73× smaller, even smaller than 1.0-MobileNetV2 and 1.5-ShuffleNet V2.
  • Compared with MnasNet, FBNet-B’s accuracy is 0.1% higher, latency is 0.6ms lower, and FLOP count is 22M (relative 7%) smaller.
  • In the third group, FBNet-C achieves 74.9% accuracy, same as 2.0-ShuffleNet V2 and better than all others.
  • The latency is 28.1 ms, 1.33× and 1.19× faster than MobileNetV2 and ShuffleNet V2.
  • The FLOP count is 1.56×, 1.58×, and 1.03× smaller than MobileNetV2, ShuffleNet V2, and MnasNet-92.
  • Among all the automatically searched models, FBNet’s performance is much stronger than DARTS, PNASNet, and NASNet, and better than MnasNet. However, the search cost is orders of magnitude lower.
  • The FBNet search takes 8 GPUs for only 27 hours, so the computational cost is only 216 GPU hours, or 421× faster than MnasNet, 222× faster than NASNet, 27.8× faster than PNASNet, and 1.33× faster than DARTS.

4.2. Different Resolution and Channel Size Scaling

  • The FBNet-96–0.35–1 model achieves 50.2% (+4.7%) accuracy and 2.9 ms latency (345 frames per second) on a Samsung Galaxy S8.
  • We can see that many layers are skipped, and the network is much shallower than FBNet-{A, B, C} because with smaller input size, the receptive field needed to parse the image also becomes smaller, so having more layers will not effectively increase the accuracy.

4.3. Different Target Devices

  • In terms of latency, the model targeted on iPhone (Samsung) has sub-optimal performance on Samsung (iPhone). This demonstrates the necessity of re-designing ConvNets for different target devices.
  • As shown above, the upper three operators are faster on iPhone X. The lower three operators are significantly faster on Samsung S8, which explains the phenomena.

--

--

PhD, Researcher. I share what I learn. :) Reads: https://bit.ly/33TDhxG, LinkedIn: https://www.linkedin.com/in/sh-tsang/, Twitter: https://twitter.com/SHTsang3

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store