# [Paper] FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (Image Classification)

## Outperforms DARTS, MnasNet, **PNASNet****, **NASNet, ShuffleNet V2, MobileNetV2 & CondenseNet

--

In this paper, **FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search (FBNet)**, by UC Berkeley, Princeton University, and Facebook Inc, is presented. In this paper:

**A differentiable neural architecture search (DNAS)**framework that uses gradient-based methods to**optimize ConvNet architectures**.**FBNets (Facebook-Berkeley-Nets)**, a family of models discovered by DNAS outperforms SOTA approaches.

This is a paper in **2019 CVPR **with over **300 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Search Space****Latency-Aware Loss Function****Differentiable Neural Architecture Search (DNAS)****Experimental Results**

# 1. Search Space

**1.1. Fixed Macro Architecture and**

- A fixed macro architecture is defined.
**The first and the last three layers**of the network have**fixed operators**.**For the rest of the layers (TBS, To Be Searched)**, their block type needs to be searched.*f*is filter size,*n*is number of blocks,*s*is the stride.

## 1.2. Searchable Micro Architecture

- Each searchable layer in the network can choose a different block from the layer-wise search.
- The block structure contains a
**point-wise (1×1)****convolution**,**a**where*K*-by-*K*depthwise convolution*K*denotes the kernel size, and**another 1×1 convolution.** **ReLU**is used at the first 1x1 convolution and the depthwise convolution, but there are no ReLUs at the last 1×1 convolution.- If the output dimension stays the same as the input dimension, we use a
**skip connection**to add the input to the output. **The expansion ratio**, is used to determine how much do we expand the output channel size of the first 1×1 convolution compared with its input channel size.*e***A kernel size of 3 or 5**for the depthwise convolution can be chosen.**Group convolution**cane be chosen to be used for the first and the last 1×1 convolution with the use of channel shuffle.

- Finally, the layer-wise search space contains
**9 candidate blocks**, as shown above. - There is a block called
**“skip” without actual computations**. This candidate block essentially allows us to**reduce the depth**of the network. - There are 1+4+4+4+4+4+1=22 TBS blocks, so it contains
**9²² ≈ 10²¹ possible architectures.**To search the architecture, it is a non-trivial task.

# 2. Latency-Aware Loss Function

- The below loss function is used:

**The first term CE(**denotes*a*,*wa*)**the cross-entropy loss**of architecture*a*with parameter*wa*.**The second term LAT(**denotes*a*)**the latency of the architecture**on the target hardware measured in micro-second. The coefficient*α**β*modulates the magnitude of the latency term.**A latency lookup table model**is used to estimate the overall latency of a network based on the runtime of each operator.

- where
*b*(*a*)*l*denotes the block at layer-*l*from architecture*a*. - By benchmarking the latency of a few hundred operators used in the search space, we can easily estimate the actual runtime of the 10²¹ architectures in the entire search space.
- Also, using the lookup table model makes the latency term in the loss function
**differentiable**.

# 3. **Differentiable Neural Architecture Search (DNAS)**

- During the inference of the super net,
**only one candidate block is sampled**and executed with the sampling probability of:

- where
*θl*contains parameters that determine**the sampling probability of each block at layer-***l*.*i*is the*i*-th block. - Equivalently, the output of layer-l can be expressed as:

- Let each layer sample independently, therefore,
**the probability of sampling an architecture**can be described as:*a*

- where
denotes the a vector consists of all the*θ**θl*,*i*. - The problem is relaxed to optimize the probability
*Pθ*of the stochastic super net to achieve the minimum expected loss:

**The discrete mask variable**computed by the Gumbel Softmax function:*ml*,*i*is relaxed to be a continuous random variable

- As
*τ*approaches 0, it approximates the discrete categorical sampling following the distribution. As*τ*becomes larger,*ml*;*i*becomes a continuous random variable. - And the latency of the architecture
*a*is:

- Thus, the overall latency of architecture
*a*is differentiable. - The search process is now equivalent to training the stochastic super net.
- After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution
*Pθ*.

In each epoch, first train the operator weights

waand then the architecture probability parameterθ.After the search finishes, several architectures are sampled from the trained distribution

Pθ, and train them from scratch.

- The proposed DNAS algorithm is
**orders of magnitude faster than previous RL based NAS.**

# 4. Experimental Results

## 4.1. ImageNet

- Samsung Galaxy S8 with a Qualcomm Snapdragon 835 platform is targeted.
*wa*is trained on 80% of ImageNet training set using SGD with momentum. The architecture distribution parameter*θ*is trained on the rest 20% of ImageNet training set with Adam.- In the first group,
**FBNet-A achieves 73.0% accuracy**, better than 1.0-MobileNetV2 (+1.0%), 1.5-ShuffleNet V2 (+0.4%), and CondenseNet(+2%), and are on par with DARTS and MnasNet-65. - Regarding latency,
**FBNet-A is 1.9 ms (relative 9.6%)**, 2.2 ms (relative 11%), and 8.6 ms (relative 43%) better than the MobileNetV2, ShuffleNet V2, and CondenseNet counterparts. **FBNet-A’s FLOP count is only 249M**, 50M smaller (relative 20%) than MobileNetV2 and ShuffleNet V2, 20M (relative 8%) smaller than MnasNet, and 2.4× smaller than DARTS.- In the second group,
**FBNet-B achieves comparable accuracy with 1.3-****MobileNetV2**, but the latency is 1.46× lower, and the FLOP count is 1.73× smaller, even smaller than 1.0-MobileNetV2 and 1.5-ShuffleNet V2. - Compared with MnasNet, FBNet-B’s accuracy is 0.1% higher, latency is 0.6ms lower, and FLOP count is 22M (relative 7%) smaller.
- In the third group,
**FBNet-C achieves 74.9% accuracy**, same as 2.0-ShuffleNet V2 and better than all others. **The latency is 28.1 ms**, 1.33× and 1.19× faster than MobileNetV2 and ShuffleNet V2.- The FLOP count is 1.56×, 1.58×, and 1.03× smaller than MobileNetV2, ShuffleNet V2, and MnasNet-92.
- Among all the automatically searched models,
**FBNet’s performance is much stronger than****DARTS****,****PNASNet****, and****NASNet****, and better than****MnasNet****.**However,**the search cost is orders of magnitude lower.** - The FBNet search takes 8 GPUs for only 27 hours, so the computational cost is only 216 GPU hours, or 421× faster than MnasNet, 222× faster than NASNet, 27.8× faster than PNASNet, and 1.33× faster than DARTS.

## 4.2. Different Resolution and Channel Size Scaling

- The
**FBNet-96–0.35–1 model**achieves 50.2% (+4.7%) accuracy and 2.9 ms latency (345 frames per second) on a Samsung Galaxy S8.

- We can see that
**many layers are skipped**, and the network is much shallower than FBNet-{A, B, C} because**with smaller input size, the receptive field needed to parse the image also becomes smaller**, so having more layers will not effectively increase the accuracy.

## 4.3. Different Target Devices

- In terms of latency, the model targeted on iPhone (Samsung) has sub-optimal performance on Samsung (iPhone). This demonstrates
**the necessity of re-designing ConvNets for different target devices.**

- As shown above, the upper three operators are faster on iPhone X. The lower three operators are significantly faster on Samsung S8, which explains the phenomena.

## Reference

[2019 CVPR] [FBNet]

FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search

## Image Classification

**1989–1998**: [LeNet]**2012–2014**: [AlexNet & CaffeNet] [Maxout] [NIN] [ZFNet] [SPPNet]**2015**: [VGGNet] [Highway] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2]**2016**: [SqueezeNet] [Inception-v3] [ResNet] [Pre-Activation ResNet] [RiR] [Stochastic Depth] [WRN] [Trimps-Soushen]**2017**: [Inception-v4] [Xception] [MobileNetV1] [Shake-Shake] [Cutout] [FractalNet] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [IGCNet / IGCV1] [Deep Roots]**2018**: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt]**2019**: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet]