Review — ResNet-RS: Re-Scaling ResNet

With Better Rescaling for ResNet, Outperforms EfficientNet

Improving ResNet as ResNet-RS, outperforms EfficientNet on the speed-accuracy Pareto curve

Revisiting ResNets: Improved Training and Scaling Strategies
ResNet-RS, by Google Brain, and UC Berkeley
2021 NeurIPS, Over 50 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Residual Network, ResNet

  • Two new scaling strategies are offered:
  1. Scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise).
  2. Increase image resolution more slowly than previously recommended.


  1. Improved Training Methods
  2. Improved Scaling Strategies
  3. Experimental Results

1. Improved Training Methods

  • Since the original ResNet uses old training recipes, before introducing new scaling strategies, training methods are improved first.
(Left) Additive study of training , regularization and architecture improvements; (Right) Decreasing weight decay improves performance when combining regularization methods such as Dropout (DO), Stochastic Depth (SD), label smoothing (LS) in Inception-v3 and RandAugment (RA).

1.1. Training, Regularization and Architecture Improvements (Left)

  • The baseline ResNet-200 gets 79.0% top-1 accuracy.
  • Its performance is improved to 82.2% (+3.2%) through improved training methods alone without any architectural changes.
  • Adding two common and simple architectural changes (Squeeze-and-Excitation in SENet, and ResNet-D in Bags of Tricks) further boosts the performance to 83.4%.

Training methods alone cause 3/4 of the total improvement.

1.2. Importance of Decreasing Weight Decay When Combining Regularization Methods (Right)

  • The amount of weight decay is decreased with the use of other regularization methods, in order for better performance.

The intuition is that since weight decay acts as a regularizer, its value must be decreased in order to not overly regularize the model when combining many techniques.

2. Improved Scaling Strategies

  • An extensive search is performed on ImageNet over width multipliers in [0.25,0.5,1.0,1.5,2.0], depths of [26,50,101,200,300,350,400] and resolutions of [128,160,224,320,448], using 350 epochs.

2.1. Strategy #1 — Depth Scaling in Regimes Where Overfitting Can Occur

Scaling of ResNets across depth, width, image resolution and training epochs

2.1.1. Right: Depth scaling outperforms width scaling for longer epoch regimes

  • Scaling the width is subject to overfitting and sometimes hurts performance even with increased regularization. This is due to the larger increase in parameters when scaling the width.

2.1.2. Left & Middle: Width scaling outperforms depth scaling for shorter epoch regimes

  • In contrast, width scaling is better when only training for 10 epochs (Left). For 100 epochs (Middle), the best performing scaling strategy varies between depth scaling and width scaling, depending

2.2 Strategy #2 — Slow Image Resolution Scaling

Scaling properties of ResNets across varying model scales.
  • For the smaller models, we observe an overall power law trend between error and FLOPs. However, the trend breaks for larger model sizes.
  • Larger image resolutions yield diminishing returns.
  • Therefore authors propose to increase the image resolution more gradually than previous works. (600 for EfficientNet-B7, 800 for EfficientNet-L2, 400+ for ResNeSt and TResNet.)

3. Experimental Results

3.1. ResNet-RS on a Speed-Accuracy Basis

Details of ResNet-RS models in Pareto curve
Speed-Accuracy Pareto curve comparing ResNets-RS to EfficientNet

ResNet-RS match EfficientNets’ performance while being 1.7×-2.7× faster on TPUs (2.1×-3.3× faster on GPUs).

  • These speed-ups are superior to those obtained by TResNet and ResNeSt.

3.2. Semi-Supervised Learning with ResNet-RS

ResNet-RS are efficient semi-supervised learners
  • ResNets-RS is trained on the combination of 1.3M labeled ImageNet images and 130M pseudo-labeled images, in a similar fashion to Noisy Student.
  • The pseudo labels are generated from an EfficientNet-L2.

ResNet-RS models are very strong in the semi-supervised learning setup as well, achieving a strong 86.2% top-1 ImageNet accuracy while being 4.7× faster on TPU (5.5× on GPU) than the corresponding EfficientNet.

3.3. Transfer Learning to Downstream Tasks with ResNet-RS

Representations from supervised learning with improved training strategies rival or outperform representations from state-of-the-art self-supervised learning algorithms
  • The improved training strategies (RS) greatly outperforms the baseline supervised training, which highlights the importance of using improved supervised training techniques when comparing to self-supervised learning algorithms.

3.4 Revised 3D ResNet for Video Classification

Additive study of regularization, training and architecture improvements with 3D-ResNet on video classification
  • The training strategies extend to video classification, yielding a combined improvement from 73.4% to 77.4% (+4.0%).
  • The ResNet-D and Squeeze-and-Excitation architectural changes further improve the performance to 78.2% (+0.8%).

Most of the improvement can be obtained without architectural changes.

  • (There are also many results in the appendix.)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store