Review —FixRes: Fixing the Train-Test Resolution Discrepancy

Larger Crop During Testing, Plus Fine-Tuning, Improves Accuracy

Train-Test Resolution Discrepancy

Fixing the Train-Test Resolution Discrepancy
FixRes, by Facebook AI Research
2019 NeurIPS, Over 200 Citations (Sik-Ho Tsang @ Medium)
Image Classification

  • Existing augmentations induce a significant discrepancy between the size of the objects seen by the classifier at train and test time.
  • Two techniques are proposed to solve this problem.
  1. Different train and test resolutions are employed.
  2. A computationally cheap fine-tuning is used.


  1. Train-Test Resolution Discrepancy
  2. Proposed techniques
  3. Experimental Results

1. Significant Discrepancy Between the Size of the Objects

  • The training practice is to extract a rectangle with random coordinates from the image, which artificially increases the amount of training data. This region, is called the Region of Classification (RoC).
  • At test time, the RoC is instead set to a square covering the central part of the image, which results in the extraction of a so called “center crop”.

While the crops extracted at training and test time have the same size, they arise from different RoCs, which skews the distribution of data seen by the CNN.

2. Proposed techniques

2.1. Larger Test Crops

  • One way is to increase the test crop size. It reduces the train-test object size mismatch.
  • However, this skews the activation statistics. Thus, fine-tuning is needed.

2.2. Resolution Adaptation by Fine-Tuning

  • Fine-tuning is restricted to the very last layers of the network using test crop size using the same training set.
  • Also, the batch normalization that precedes the global pooling is also fine-tuned.

3. Results

3.1. Single Resolution Testing

Top-1 accuracy of the ResNet-50 according to the test time resolution. Left: without adaptation, right: after resolution adaptation
  • Ktrain: the crop size during training; Ktest: crop size during testing.
  • Right: With fine-tuning, the best results (79%) got with the classic ResNet-50 trained at Ktrain = 224.
  • Compared to when there is no fine-tuning, the Ktest at which the maximal accuracy is obtained increases from Ktest = 288 to 384.
  • If we prefer to reduce the training resolution, Ktrain = 128 and testing at Ktrain = 224 yields 77.1% accuracy, which is above the baseline trained at full test resolution without fine-tuning.

3.2. Multiple Resolution Testing

  • To improve the accuracy, the image is classified at several resolutions and the classification scores are averaged.
  • With Ktrain = 128 and Ktest = [256, 192], the accuracy is 78.0%.
  • With Ktrain = 224 and Ktest = [384, 352], the results are improved from the single-crop result of 79.0% to 79.5%.

3.3. Larger Networks

Application to larger networks: Resulting top-1 accuracy
  • For the PNASNet-5-Large, it is beneficial to fine-tune more than just the batch-normalization and the classifier. The three last cells are also fine-tuned. By increasing the resolution to Ktest = 480, the accuracy increases by 1 percentage point.
  • By combining this with an ensemble of 10 crops at test time, we obtain 83.9% accuracy.
  • With the ResNeXt-101 32x48d increasing the resolution to Ktest = 320, the accuracy increases by 1.0 percentage point. 86.4% top-1 accuracy is reached.

3.4. Speed-Accuracy Trade-Off

  • In the low-resolution training regime (Ktrain = 128), the additional fine-tuning required by our method increases the training time from 111.8 h to 124.1 h (+11%). This is to obtain an accuracy of 77.1%, which outperforms the network trained at the native resolution of 224 in 133.9 h.
  • A fine-tuned network with Ktest = 384 is produced that obtains a higher accuracy than the network trained natively at that resolution, and the training is 2.3× faster: 151.5 h instead of 348.5 h.

3.5. SOTA Comparison

State of the art on ImageNet with ResNet-50 architectures and with all types of architecture (Single Crop evaluation)

With 86.4% Top-1 accuracy and 98.0% Top-5 accuracy, it is the first model to exceed 86.0% in Top-1 accuracy and 98.0% in Top-5 accuracy on the ImageNet-2012 benchmark.

  • It exceeds the previous state of the art by 1.0% absolute in Top-1 accuracy and 0.4% Top-5 accuracy.

3.6. Transfer Learning Tasks

Transfer learning task with our method and comparison with the state of the art
  1. The network is initialized with the weights learned on ImageNet.
  2. train it entirely for several epochs at a certain resolution.
  3. fine-tune with a higher resolution the last batch norm and the fully connected layer.

In all cases, the proposed method improves the performance of the baseline.

  • It is called FixRes because authors later on published FixEfficientNet mentioning this paper as FixRes.

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Matrix Factorization in Recommender Systems

[Review] ABN: Attention Branch Network for Visual Explanation (Image Classification)

Resize Shape & Decrease Quality of all images in a directory using Python

Understanding Attention Modules: CBAM and BAM — A Quick Read

Transfer Learning experiment on CIFAR 10 dataset

Gesture Detection with a Raspberry Pi

Revew: WDRN / WavResNet — Wavelet-based Deep Residual Learning Network (Image Denoising & Super…

What Are Supervised and Unsupervised Learning in Machine Learning?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Review — Fixing the train-test resolution discrepancy: FixEfficientNet

Review of Deep Learning Architectures for Image Classification Problem (Part 2)

Panoptic Segmentation Explained

Using pre-trained Vision Transformer model and ResNet model as features extractors for image…