Review — ResNet Strikes Back: An Improved Training Procedure in timm

ResNet Beats Vision Transformer (ViT) With Better Training Strategies

Sik-Ho Tsang
4 min readFeb 10, 2022
The ResNet Empire Strikes Back

ResNet Strikes Back: An Improved Training Procedure in timm
ResNet Strikes Back, by Independent researcher, Facebook AI, and Sorbonne University, 2021 NeurIPS, Over 10 Citations (

@ Medium)
Image Classification, Residual Network, ResNet, PyTorch Image Models (timm)

  • When comparing architectures, most papers compare original ResNet which was reported in a quite old publications, thus ResNet was trained with potentially weaker recipes.
  • This paper targets to find the best performance of ResNet-50 using the competitive training settings, which can be served as better baseline.


  1. Model Optimization
  2. Training Procedures & Strategies
  3. Experimental Results

1. Model Optimization

  • Schematically, the accuracy of a model is a function of below form:
  • where A is the architecture design, T is the training setting along with its hyperparameters, and N is the measurement noise.
  • Ideally, i.e., without resource and time constraints, one would optimally adopt the best possible training procedure for each architecture:
  • When optimizing jointly over (A, T), there is no guarantee that the optimal choice T1 for a given architecture A1 remains the best for another model design A2.

In this paper, the training is optimized so as to maximize the performance of ResNet for the original test resolution of 224×224.

2. Training Procedures & Strategies

2.1. Procedures A1 to A3

  • Procedure A1: aims at providing the best performance for ResNet-50. It is therefore the longest in terms of epochs (600) and training time (4.6 days on one node with 4 V100 32GB GPUs).
  • Procedure A2: is a 300 epochs schedule that is comparable to several modern procedures like DeiT, except with a larger batch size of 2048 and other choices introduced for all our recipes.
  • Procedure A3: aims at outperforming the original ResNet-50 procedure with a short schedule of 100 epochs and a batch size 2048. It can be trained in 15h on 4 V100 16GB GPUs and could be a good setting for exploratory research or studies.

2.2. Ingredients included in A1 to A3

  • Loss — multi-label classification objective: mixup and CutMix augmentation synthesize an image from several images having in most cases different labels. The binary cross-entropy (BCE) loss instead of the typical cross-entropy (CE). Label smoothing can be added.

BCE slightly outperforms CE.

  • Data Augmentation: On top of standard Random Resized Crop (RRC) and horizontal flip (commonly used since GoogLeNet), timm variants of RandAugment, mixup, and CutMix.

This combination was used for instance in DeiT.

  • Regularization: In addition to adapting the weight decay, label smoothing (Inception-v3), Repeated-Augmentation [3, 17] (RA) and Stochastic Depth [21] are used.

More regularization for longer training schedules.

  • Optimization: Larger batches, e.g., 2048, are used. When combined with repeated augmentation and the binary cross entropy loss, we found that LAMB makes it easier to consistently achieve good results.
  • It is found that it is difficult to achieve convergence when using both SGD and BCE.

LAMB with cosine schedule is used as the default optimizer for training our ResNet-50.

3. Experimental Results

Ingredients and hyper-parameters used for ResNet-50 training in different papers.

The procedure A1 surpasses the current state of the art on ImageNet, such as FixRes, DeiT, with a vanilla ResNet-50 architecture at resolution 224×224.

Comparison on ImageNet classification between other architectures trained with the ResNet-50 optimized training procedure

There are a lot of details about experimental setups as well as other results, please feel free to read the paper.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.