[Paper] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (Image Classification)

Image for post
Image for post

In this story, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (EfficientNet), by Google Research, Brain Team, is presented. In this paper:

  • Model scaling is systematically studied to carefully balance network depth, width, and resolution that can lead to better performance.
  • An effective compound coefficient is proposed to uniformly scale all dimensions of depth/width/resolution.
  • With neural architecture search (NAS), EfficientNet is obtained.
Image for post
Image for post

This is a paper in 2019 ICML with over 1100 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Single Dimension Scaling
  2. Compound Scaling
  3. EfficientNet Architecture

1. Single Dimension Scaling

Image for post
Image for post

1.1. (a) Baseline

  • A ConvNet is defined as:
Image for post
Image for post
  • where FLii denotes layer Fi is repeated Li times in stage i, (Hi, Wi, Ci) denotes the shape of input tensor X of layer i.

To expand or shrink the network for different applications/purposes, or to have fair comparison with other networks, model scaling is usually performed.

  • Model scaling tries to expand the network length (Li), width (Ci), and/or resolution (Hi, Wi) without changing Fi predefined in the baseline network.
  • By fixing Fi, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different Li, Ci, Hi, Wi for each layer.
Image for post
Image for post
  • where w, d, r are coefficients for scaling network width, depth, and resolution. The symbols with heads are the predefined parameters in the baseline network.

1.2. (b)-(d) Naïve Scaling Dimensions

  • (b) - Depth (d): Scaling network depth is the most common way used by many ConvNets.
  • However, scaling a baseline model with different depth coefficient d, further suggesting the diminishing accuracy return for very deep ConvNets.
  • (c) - Width (w): Scaling network width is commonly used for small size models.
  • Wider networks tend to be able to capture more fine-grained features and are easier to train.
  • However, extremely wide but shallow networks tend to have difficulties in capturing higher level features.
  • (d) - Resolution (r): With higher resolution input images, ConvNets can potentially capture more fine-grained patterns.
  • Higher resolutions improve accuracy, but the accuracy gain diminishes for very high resolutions.

2. Compound Scaling

  • Intuitively, for higher resolution images, increasing network depth obtains the larger receptive fields that can help capture similar features that include more pixels in bigger images.
  • Correspondingly, we should also increase network width when resolution is higher in order to capture more fine-grained patterns.
  • (e) - Compound Scaling: We need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling.
Image for post
Image for post
  • An example is as shown above.
  • If we only scale network width w without changing depth (d=1.0) and resolution (r=1.0), the accuracy saturates quickly.
  • With deeper (d=2.0) and higher resolution (r=2.0), width scaling achieves much better accuracy under the same FLOPS cost.

It is critical to balance all dimensions of network width, depth, and resolution.

  • A compound coefficient  to uniformly scales network width, depth, and resolution in a principled way:
Image for post
Image for post
  • where α, β, γ are constants that can be determined by a small grid search.
  • Intuitively, Φ is a user-specified coefficient that controls how many more resources are available for model scaling, while α, β, γ specify how to assign these extra resources to network width, depth, and resolution respectively.
  • Notably, the FLOPS of a regular convolution op is proportional to d, w², r².
  • In this paper, it is constrainted (α×β²×γ²)^Φ such that for any new Φ, the total FLOPS will approximately increase by 2^Φ.

3. EfficientNet Architecture

3.1. EfficientNet-B0

  • MnasNet is used as the Neural Architecture Search (NAS) to find the baseline network.
  • Same search space is used. Its main building block is mobile inverted bottleneck MBConv, with also Squeeze and Excitation Module (SE Module), originated in SENet, is also used.
  • And ACC(m)×[FLOPS(m)/T]^w is used as the optimization goal, where ACC(m) and FLOPS(m) denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS.
  • But unlike MnasNet, FLOPS rather than latency is optimized.
Image for post
Image for post
Image for post
Image for post
  • An architecture is found called EfficientNet-B0, which is similar the one found in MnasNet.
  • EfficientNet-B0 is slightly bigger due to the larger FLOPS target (FLOPS target is 400M).

3.2. Compounding Scaling on EfficientNet-B0

  • Two steps to perform compound scaling.
  • STEP 1: First fix  Φ=1, assuming twice more resources available, and do a small grid search of α, β, γ based on the above two equations.
  • In particular, the best values found for EfficientNet-B0 are α=1.2; β=1.1, γ=1.15, under constraint of α×β²×γ²≈2
  • STEP 2: Fix α, β, γ as constants and scale up baseline network with different using the equation in Section 2, to obtain EfficientNet-B1 to B7.

4. Experimental Results

4.1. Scaling up Efficient-B0

Image for post
Image for post
Image for post
Image for post
  • All scaling methods improve accuracy with the cost of more FLOPS, but the compound scaling method can further improve accuracy, by up to 2.5%, than other single-dimension scaling methods.

4.2. Scaling Up MobileNets and ResNets

Image for post
Image for post
  • Compared to other single-dimension scaling methods, the proposed compound scaling method improves the accuracy on all these models, suggesting the effectiveness of the proposed scaling method for general existing ConvNets: MobileNetV1, MobileNetV2 and ResNet.

4.3. ImageNet Results for EfficientNet

Image for post
Image for post
  • As bigger models need more regularization, Dropout ratio is linearly increased from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7.
  • Swish activation, fixed AutoAugment policy, and Stochastic Depth are also used.

EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy.

  • EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4× smaller than the previous best GPipe.
Image for post
Image for post
  • The above figure shows FLOPS vs. ImageNet Accuracy.
  • The figure at the top of the story shows Model Size vs. ImageNet Accuracy.

The EfficientNet models are not only small, but also computational cheaper.

  • EfficientNet-B3 achieves higher accuracy than ResNeXt-101 using 18× fewer FLOPS.
Image for post
Image for post
  • Latency is measured with batch size 1 on a single core of Intel Xeon CPU E5–2690.
  • EfficientNet-B1 runs 5.7× faster than the widely used ResNet-152, while EfficientNet-B7 runs about 6.1× faster than GPipe.

4.4. Transfer Learning Results for EfficientNet

Image for post
Image for post
  • ImageNet pretrained and finetuned on new datasets.
Image for post
Image for post
  • Compared with NASNet-A and Inception-v4, EfficientNet models achieve better accuracy with 4.7× average (up to 21×) parameter reduction.
Image for post
Image for post
  • EfficientNets consistently achieve better accuracy with an order of magnitude fewer parameters than existing models, including ResNet, DenseNet, Inception-v4, and NASNet.

4.5. Class Activation Map (CAM)

Image for post
Image for post
  • As shown above, the model with compound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the image.

Written by

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store