[Paper] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (Image Classification)

Compounding Scaling on Depth, Width, and Resolution, outperforms AmoebaNet, PNASNet, NASNet, SENet, DenseNet, Inception-v4, Inception-v3, Inception-v2, Xception, ResNeXt, PolyNet & ResNet

Sik-Ho Tsang
7 min readNov 29, 2020
Model Size vs. ImageNet Accuracy.

In this story, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (EfficientNet), by Google Research, Brain Team, is presented. In this paper:

  • Model scaling is systematically studied to carefully balance network depth, width, and resolution that can lead to better performance.
  • An effective compound coefficient is proposed to uniformly scale all dimensions of depth/width/resolution.
  • With neural architecture search (NAS), EfficientNet is obtained.

This is a paper in 2019 ICML with over 1100 citations. (Sik-Ho Tsang @ Medium)


  1. Single Dimension Scaling
  2. Compound Scaling
  3. EfficientNet Architecture

1. Single Dimension Scaling

(a) baseline (b)-(d) Single Dimension Scaling (e) Compound Scaling

1.1. (a) Baseline

  • A ConvNet is defined as:
  • where FLii denotes layer Fi is repeated Li times in stage i, (Hi, Wi, Ci) denotes the shape of input tensor X of layer i.

To expand or shrink the network for different applications/purposes, or to have fair comparison with other networks, model scaling is usually performed.

  • Model scaling tries to expand the network length (Li), width (Ci), and/or resolution (Hi, Wi) without changing Fi predefined in the baseline network.
  • By fixing Fi, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different Li, Ci, Hi, Wi for each layer.
  • where w, d, r are coefficients for scaling network width, depth, and resolution. The symbols with heads are the predefined parameters in the baseline network.

1.2. (b)-(d) Naïve Scaling Dimensions

  • (b) - Depth (d): Scaling network depth is the most common way used by many ConvNets.
  • However, scaling a baseline model with different depth coefficient d, further suggesting the diminishing accuracy return for very deep ConvNets.
  • (c) - Width (w): Scaling network width is commonly used for small size models.
  • Wider networks tend to be able to capture more fine-grained features and are easier to train.
  • However, extremely wide but shallow networks tend to have difficulties in capturing higher level features.
  • (d) - Resolution (r): With higher resolution input images, ConvNets can potentially capture more fine-grained patterns.
  • Higher resolutions improve accuracy, but the accuracy gain diminishes for very high resolutions.

2. Compound Scaling

  • Intuitively, for higher resolution images, increasing network depth obtains the larger receptive fields that can help capture similar features that include more pixels in bigger images.
  • Correspondingly, we should also increase network width when resolution is higher in order to capture more fine-grained patterns.
  • (e) - Compound Scaling: We need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling.
Scaling Network Width for Different Baseline Networks.
  • An example is as shown above.
  • If we only scale network width w without changing depth (d=1.0) and resolution (r=1.0), the accuracy saturates quickly.
  • With deeper (d=2.0) and higher resolution (r=2.0), width scaling achieves much better accuracy under the same FLOPS cost.

It is critical to balance all dimensions of network width, depth, and resolution.

  • A compound coefficient  to uniformly scales network width, depth, and resolution in a principled way:
  • where α, β, γ are constants that can be determined by a small grid search.
  • Intuitively, Φ is a user-specified coefficient that controls how many more resources are available for model scaling, while α, β, γ specify how to assign these extra resources to network width, depth, and resolution respectively.
  • Notably, the FLOPS of a regular convolution op is proportional to d, w², r².
  • In this paper, it is constrainted (α×β²×γ²)^Φ such that for any new Φ, the total FLOPS will approximately increase by 2^Φ.

3. EfficientNet Architecture

3.1. EfficientNet-B0

  • MnasNet is used as the Neural Architecture Search (NAS) to find the baseline network.
  • Same search space is used. Its main building block is mobile inverted bottleneck MBConv, with also Squeeze and Excitation Module (SE Module), originated in SENet, is also used.
  • And ACC(m)×[FLOPS(m)/T]^w is used as the optimization goal, where ACC(m) and FLOPS(m) denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS.
  • But unlike MnasNet, FLOPS rather than latency is optimized.
EfficientNet-B0 baseline network
EfficientNet-B0 baseline network
  • An architecture is found called EfficientNet-B0, which is similar the one found in MnasNet.
  • EfficientNet-B0 is slightly bigger due to the larger FLOPS target (FLOPS target is 400M).

3.2. Compounding Scaling on EfficientNet-B0

  • Two steps to perform compound scaling.
  • STEP 1: First fix  Φ=1, assuming twice more resources available, and do a small grid search of α, β, γ based on the above two equations.
  • In particular, the best values found for EfficientNet-B0 are α=1.2; β=1.1, γ=1.15, under constraint of α×β²×γ²≈2
  • STEP 2: Fix α, β, γ as constants and scale up baseline network with different using the equation in Section 2, to obtain EfficientNet-B1 to B7.

4. Experimental Results

4.1. Scaling up Efficient-B0

Scaling Up EfficientNet-B0 with Different Methods
Scaled Models
  • All scaling methods improve accuracy with the cost of more FLOPS, but the compound scaling method can further improve accuracy, by up to 2.5%, than other single-dimension scaling methods.

4.2. Scaling Up MobileNets and ResNets

Scaling Up MobileNets and ResNet
  • Compared to other single-dimension scaling methods, the proposed compound scaling method improves the accuracy on all these models, suggesting the effectiveness of the proposed scaling method for general existing ConvNets: MobileNetV1, MobileNetV2 and ResNet.

4.3. ImageNet Results for EfficientNet

EfficientNet Performance Results on ImageNet
  • As bigger models need more regularization, Dropout ratio is linearly increased from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.
  • Swish activation, fixed AutoAugment policy, and Stochastic Depth are also used.

EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy.

  • EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4× smaller than the previous best GPipe.
FLOPS vs. ImageNet Accuracy
  • The above figure shows FLOPS vs. ImageNet Accuracy.
  • The figure at the top of the story shows Model Size vs. ImageNet Accuracy.

The EfficientNet models are not only small, but also computational cheaper.

  • EfficientNet-B3 achieves higher accuracy than ResNeXt-101 using 18× fewer FLOPS.
Inference Latency Comparison
  • Latency is measured with batch size 1 on a single core of Intel Xeon CPU E5–2690.
  • EfficientNet-B1 runs 5.7× faster than the widely used ResNet-152, while EfficientNet-B7 runs about 6.1× faster than GPipe.

4.4. Transfer Learning Results for EfficientNet

Transfer Learning Datasets
  • ImageNet pretrained and finetuned on new datasets.
EfficientNet Performance Results on Transfer Learning Datasets
  • Compared with NASNet-A and Inception-v4, EfficientNet models achieve better accuracy with 4.7× average (up to 21×) parameter reduction.
Model Parameters vs. Transfer Learning Accuracy
  • EfficientNets consistently achieve better accuracy with an order of magnitude fewer parameters than existing models, including ResNet, DenseNet, Inception-v4, and NASNet.

4.5. Class Activation Map (CAM)

Class Activation Map (CAM) for Models with different scaling methods
  • As shown above, the model with compound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the image.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.