# [Paper] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (Image Classification)

## Compounding Scaling on **Depth, Width, and Resolution, o**utperforms AmoebaNet, PNASNet, NASNet, SENet, DenseNet, Inception-v4, Inception-v3, Inception-v2, Xception, ResNeXt, PolyNet & ResNet

In this story, **EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (EfficientNet)**, by Google Research, Brain Team, is presented. In this paper:

**Model scaling**is systematically studied to**carefully balance network depth, width, and resolution**that can lead to better performance.**An effective compound coefficient**is proposed to**uniformly scale all dimensions of depth/width/resolution.**- With neural architecture search (NAS),
**EfficientNet**is obtained.

This is a paper in **2019 ICML **with over **1100 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Single Dimension Scaling****Compound Scaling****EfficientNet Architecture**

**1. Single Dimension Scaling**

## 1.1. (a) Baseline

- A ConvNet is defined as:

- where
*FLii*denotes layer*Fi*is repeated*Li*times in stage*i*, (*Hi*,*Wi*,*Ci*) denotes the shape of input tensor*X*of layer*i*.

To expand or shrink the network for different applications/purposes, or to have fair comparison with other networks, model scaling is usually performed.

**Model scaling tries to expand the network length (***Li*), width (*Ci*), and/or resolution (*Hi*,*Wi*) without changing*Fi*predefined in the baseline network.- By fixing
*Fi*, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different*Li*,*Ci*,*Hi*,*Wi*for each layer.

- where
*w*,*d*,*r*are coefficients for scaling network width, depth, and resolution. The symbols with heads are the predefined parameters in the baseline network.

## 1.2. (b)-(d) Naïve Scaling Dimensions

**(b) - Depth (***d*): Scaling network depth is the most common way used by many ConvNets.- However, scaling a baseline model with different depth coefficient
*d*, further suggesting the**diminishing accuracy return for very deep ConvNets.** **(c) - Width (**:*w*)**Scaling network width is commonly used for small size models.**- Wider networks tend to be able to capture more fine-grained features and are easier to train.
- However,
**extremely wide but shallow networks tend to have difficulties in capturing higher level features.** **(d) - Resolution (**: With higher resolution input images, ConvNets can potentially capture more fine-grained patterns.*r*)- Higher resolutions improve accuracy, but
**the accuracy gain diminishes for very high resolutions.**

# 2. Compound Scaling

- Intuitively, for higher resolution images, increasing network depth obtains the larger receptive fields that can help capture similar features that include more pixels in bigger images.
- Correspondingly, we should also increase network width when resolution is higher in order to capture more fine-grained patterns.
**(e) - Compound Scaling: We need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling.**

- An example is as shown above.
- If we only scale network width w without changing depth (
*d*=1.0) and resolution (*r*=1.0), the accuracy saturates quickly. **With deeper (***d*=2.0) and higher resolution (*r*=2.0), width scaling achieves much better accuracy under the same FLOPS cost.

It is critical to balance all dimensions of network width, depth, and resolution.

**A compound coefficient to uniformly scales network width, depth, and resolution in a principled way:**

- where
*α*,*β*,*γ*are constants that can be determined by a small grid search. - Intuitively,
, while*Φ*is a user-specified coefficient that controls how many more resources are available for model scaling*α*,*β*,*γ*specify how to assign these extra resources to network width, depth, and resolution respectively. - Notably, the FLOPS of a regular convolution op is proportional to
*d*,*w*²,*r*². - In this paper, it is constrainted (
*α*×*β²*×*γ²*)^*Φ*such that for any new*Φ*, the total FLOPS will approximately increase by 2^*Φ*.

# 3. EfficientNet Architecture

## 3.1. **EfficientNet-B0**

**MnasNet**- Same search space is used. Its main building block is mobile inverted bottleneck
**MBConv**, with also**Squeeze and Excitation Module (SE Module)**, originated in SENet, is also used. - And ACC(
*m*)×[FLOPS(*m*)/*T*]^*w*is used as the optimization goal, where ACC(*m*) and FLOPS(*m*) denote the accuracy and FLOPS of model*m*,*T*is the target FLOPS and*w*=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS. - But unlike MnasNet, FLOPS rather than latency is optimized.

- An architecture is found called
**EfficientNet-B0**, which is similar the one found in MnasNet. - EfficientNet-B0 is slightly bigger due to the larger FLOPS target (FLOPS target is 400M).

## 3.2. Compounding Scaling on EfficientNet-B0

- Two steps to perform compound scaling.
**STEP 1**: First**fix**, assuming twice more resources available, and*Φ*=1**do a small grid search of***α*,*β*,*γ*- In particular, the best values found for EfficientNet-B0 are
=1.2;*α*=1.1,*β*=1.15, under constraint of*γ**α*×*β²*×*γ²*≈2 **STEP 2**: Fix*α*,*β*,*γ*as constants and scale up baseline network with different using the equation in Section 2, to obtain**EfficientNet-B1 to B7**.

# 4. Experimental Results

## 4.1. Scaling up Efficient-B0

- All scaling methods improve accuracy with the cost of more FLOPS, but the compound scaling method can further improve accuracy, by up to 2.5%, than other single-dimension scaling methods.

## 4.2. Scaling Up MobileNets and ResNets

**Compared to other single-dimension scaling methods, the proposed compound scaling method improves the accuracy on all these models**, suggesting the effectiveness of the proposed scaling method for general existing ConvNets: MobileNetV1, MobileNetV2 and ResNet.

## 4.3. ImageNet Results for EfficientNet

- As bigger models need more regularization, Dropout ratio is linearly increased from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.
- Swish activation, fixed AutoAugment policy, and Stochastic Depth are also used.

EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy.

**EfficientNet-B7**achieves**84.4% top1 / 97.1% top-5 accuracy**with**66M parameters**and**37B FLOPS**, being more accurate but**8.4× smaller than the previous best GPipe.**

- The above figure shows FLOPS vs. ImageNet Accuracy.
- The figure at the top of the story shows Model Size vs. ImageNet Accuracy.

The EfficientNet models are not only small, but also computational cheaper.

- EfficientNet-B3 achieves higher accuracy than ResNeXt-101 using 18× fewer FLOPS.

- Latency is measured with batch size 1 on a single core of Intel Xeon CPU E5–2690.
**EfficientNet-B1 runs 5.7× faster than the widely used****ResNet****-152, while EfficientNet-B7 runs about 6.1× faster than GPipe**.

## 4.4. Transfer Learning Results for EfficientNet

- ImageNet pretrained and finetuned on new datasets.

- Compared with NASNet-A and Inception-v4, EfficientNet models achieve better accuracy with 4.7× average (up to 21×) parameter reduction.

- EfficientNets consistently achieve better accuracy with an order of magnitude fewer parameters than existing models, including ResNet, DenseNet, Inception-v4, and NASNet.

## 4.5. Class Activation Map (CAM)

- As shown above,
**the model with compound scaling tends to focus on more relevant regions with more object details**, while other models are either lack of object details or unable to capture all objects in the image.

## References

[2019 ICML] [EfficientNet]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

[Google Blog] https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html

## Image Classification

**1989–1998**: [LeNet]**2012–2014**: [AlexNet & CaffeNet] [Maxout] [NIN] [ZFNet] [SPPNet]**2015**: [VGGNet] [Highway] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2]**2016**: [SqueezeNet] [Inception-v3] [ResNet] [Pre-Activation ResNet] [RiR] [Stochastic Depth] [WRN] [Trimps-Soushen]**2017**: [Inception-v4] [Xception] [MobileNetV1] [Shake-Shake] [Cutout] [FractalNet] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [IGCNet / IGCV1] [Deep Roots]**2018**: [RoR] [DMRNet / DFN-MR] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2] [CondenseNet] [IGCV2] [IGCV3] [FishNet] [SqueezeNext] [ENAS] [PNASNet] [ShuffleNet V2] [BAM] [CBAM] [MorphNet] [NetAdapt] [mixup] [DropBlock]**2019**: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix] [MixConv] [EfficientNet]**2020**: [Random Erasing (RE)]