[Paper] FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (Image Classification)

Outperforms DARTS, MnasNet, PNASNet, NASNet, ShuffleNet V2, MobileNetV2 & CondenseNet

Sik-Ho Tsang
7 min readNov 15, 2020
Differentiable neural architecture search (DNAS) for ConvNet design

In this paper, FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (FBNet), by UC Berkeley, Princeton University, and Facebook Inc, is presented. In this paper:

  • A differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures.
  • FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS outperforms SOTA approaches.

This is a paper in 2019 CVPR with over 300 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Search Space
  2. Latency-Aware Loss Function
  3. Differentiable Neural Architecture Search (DNAS)
  4. Experimental Results

1. Search Space

1.1. Fixed Macro Architecture and

Fixed Macro Architecture
  • A fixed macro architecture is defined.
  • The first and the last three layers of the network have fixed operators.
  • For the rest of the layers (TBS, To Be Searched), their block type needs to be searched.
  • f is filter size, n is number of blocks, s is the stride.

1.2. Searchable Micro Architecture

The block structure of the micro-architecture search space.
  • Each searchable layer in the network can choose a different block from the layer-wise search.
  • The block structure contains a point-wise (1×1) convolution, a K-by-K depthwise convolution where K denotes the kernel size, and another 1×1 convolution.
  • ReLU is used at the first 1x1 convolution and the depthwise convolution, but there are no ReLUs at the last 1×1 convolution.
  • If the output dimension stays the same as the input dimension, we use a skip connection to add the input to the output.
  • The expansion ratio e, is used to determine how much do we expand the output channel size of the first 1×1 convolution compared with its input channel size.
  • A kernel size of 3 or 5 for the depthwise convolution can be chosen.
  • Group convolution cane be chosen to be used for the first and the last 1×1 convolution with the use of channel shuffle.
Configurations of candidate blocks in the search space
  • Finally, the layer-wise search space contains 9 candidate blocks, as shown above.
  • There is a block called “skip” without actual computations. This candidate block essentially allows us to reduce the depth of the network.
  • There are 1+4+4+4+4+4+1=22 TBS blocks, so it contains 9²² ≈ 10²¹ possible architectures. To search the architecture, it is a non-trivial task.

2. Latency-Aware Loss Function

  • The below loss function is used:
  • The first term CE(a,wa) denotes the cross-entropy loss of architecture a with parameter wa.
  • The second term LAT(a) denotes the latency of the architecture on the target hardware measured in micro-second. The coefficient α controls the overall magnitude of the loss function. The exponent coefficient β modulates the magnitude of the latency term.
  • A latency lookup table model is used to estimate the overall latency of a network based on the runtime of each operator.
  • where b(a)l denotes the block at layer-l from architecture a.
  • By benchmarking the latency of a few hundred operators used in the search space, we can easily estimate the actual runtime of the 10²¹ architectures in the entire search space.
  • Also, using the lookup table model makes the latency term in the loss function differentiable.

3. Differentiable Neural Architecture Search (DNAS)

  • During the inference of the super net, only one candidate block is sampled and executed with the sampling probability of:
  • where θl contains parameters that determine the sampling probability of each block at layer-l. i is the i-th block.
  • Equivalently, the output of layer-l can be expressed as:
  • Let each layer sample independently, therefore, the probability of sampling an architecture a can be described as:
  • where θ denotes the a vector consists of all the θl,i.
  • The problem is relaxed to optimize the probability  of the stochastic super net to achieve the minimum expected loss:
  • The discrete mask variable ml,i is relaxed to be a continuous random variable computed by the Gumbel Softmax function:
  • As τ approaches 0, it approximates the discrete categorical sampling following the distribution. As τ becomes larger, ml;i becomes a continuous random variable.
  • And the latency of the architecture a is:
  • Thus, the overall latency of architecture a is differentiable.
  • The search process is now equivalent to training the stochastic super net.
  • After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution .

In each epoch, first train the operator weights wa and then the architecture probability parameter θ.

After the search finishes, several architectures are sampled from the trained distribution , and train them from scratch.

  • The proposed DNAS algorithm is orders of magnitude faster than previous RL based NAS.

4. Experimental Results

4.1. ImageNet

ImageNet
  • Samsung Galaxy S8 with a Qualcomm Snapdragon 835 platform is targeted.
  • wa is trained on 80% of ImageNet training set using SGD with momentum. The architecture distribution parameter θ is trained on the rest 20% of ImageNet training set with Adam.
  • In the first group, FBNet-A achieves 73.0% accuracy, better than 1.0-MobileNetV2 (+1.0%), 1.5-ShuffleNet V2 (+0.4%), and CondenseNet(+2%), and are on par with DARTS and MnasNet-65.
  • Regarding latency, FBNet-A is 1.9 ms (relative 9.6%), 2.2 ms (relative 11%), and 8.6 ms (relative 43%) better than the MobileNetV2, ShuffleNet V2, and CondenseNet counterparts.
  • FBNet-A’s FLOP count is only 249M, 50M smaller (relative 20%) than MobileNetV2 and ShuffleNet V2, 20M (relative 8%) smaller than MnasNet, and 2.4× smaller than DARTS.
  • In the second group, FBNet-B achieves comparable accuracy with 1.3-MobileNetV2, but the latency is 1.46× lower, and the FLOP count is 1.73× smaller, even smaller than 1.0-MobileNetV2 and 1.5-ShuffleNet V2.
  • Compared with MnasNet, FBNet-B’s accuracy is 0.1% higher, latency is 0.6ms lower, and FLOP count is 22M (relative 7%) smaller.
  • In the third group, FBNet-C achieves 74.9% accuracy, same as 2.0-ShuffleNet V2 and better than all others.
  • The latency is 28.1 ms, 1.33× and 1.19× faster than MobileNetV2 and ShuffleNet V2.
  • The FLOP count is 1.56×, 1.58×, and 1.03× smaller than MobileNetV2, ShuffleNet V2, and MnasNet-92.
  • Among all the automatically searched models, FBNet’s performance is much stronger than DARTS, PNASNet, and NASNet, and better than MnasNet. However, the search cost is orders of magnitude lower.
  • The FBNet search takes 8 GPUs for only 27 hours, so the computational cost is only 216 GPU hours, or 421× faster than MnasNet, 222× faster than NASNet, 27.8× faster than PNASNet, and 1.33× faster than DARTS.

4.2. Different Resolution and Channel Size Scaling

  • The FBNet-96–0.35–1 model achieves 50.2% (+4.7%) accuracy and 2.9 ms latency (345 frames per second) on a Samsung Galaxy S8.
Visualization of some of the searched architectures
  • We can see that many layers are skipped, and the network is much shallower than FBNet-{A, B, C} because with smaller input size, the receptive field needed to parse the image also becomes smaller, so having more layers will not effectively increase the accuracy.

4.3. Different Target Devices

Different Target Devices
  • In terms of latency, the model targeted on iPhone (Samsung) has sub-optimal performance on Samsung (iPhone). This demonstrates the necessity of re-designing ConvNets for different target devices.
  • As shown above, the upper three operators are faster on iPhone X. The lower three operators are significantly faster on Samsung S8, which explains the phenomena.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.