Review: ImageNet-V2

A New Test Set for ImageNet

Sik-Ho Tsang
6 min readFeb 7, 2022
(a) Original ImageNet Validation Set, (b) New Test Set

Do ImageNet Classifiers Generalize to ImageNet?
ImageNet-V2, by UC Berkeley
2019 ICML, Over 400 Citations (

@ Medium)
Image Classification, ImageNet Dataset

  • Overfitting is raised initially due to excessively re-used test sets.
  • By closely following the original dataset creation processes, new test sets for the CIFAR-10 and ImageNet datasets are built.
  • A broad range of models are evaluated and there are accuracy drops of 3% — 15% on CIFAR-10 and 11% — 14% on ImageNet.


  1. The Reason to Introduce Extra Test Set
  2. Potential Causes of Accuracy Drops
  3. Dataset Construction
  4. Experimental Results

1. The Reason to Introduce Extra Test Set

  • The overarching goal of machine learning is to produce models that generalize. The generalization is quantified by measuring the performance of a model on a held-out test set.
  • Conventional wisdom suggests that such drops arise because the models have been adapted to the specific images in the original test sets, e.g., via extensive hyperparameter tuning.

So, Do ImageNet Classifiers Generalize to ImageNet?

  • In this paper, new test set is introduced to answer this question.

2. Potential Causes of Accuracy Drops

2.1. Test Set as Population

  • The standard classification setup is adopted and the existence of a “true” underlying data distribution D over labeled examples (x, y) is posited.
  • The overall goal in classification is to find a model ^f that minimizes the population loss:
  • Since we usually do not know the distribution D, we instead measure the performance of a trained classifier via a test set S drawn from the distribution D:

This test error LS(^f) is treated as a proxy for the population loss LD(^f).

If a model ^f achieves a low test error, we assume that it will perform similarly well on future examples from the distribution D.

  • (I think people who know data science will know the above stuff.)

2.2. Decomposition of Loss Difference

  • In this paper, a new test set S’ is collected from a data distribution D’ that is carefully controlled to resemble the original distribution D.
  • Ideally, the original test accuracy LS(^f) and new test accuracy LS’(^f) would then match up to the random sampling error.
  • However, there is accuracy drop using the new test set (Results at the next sections.)
  • To understand this accuracy drop in more detail, the difference between LS(^f) and LS’(^f) is decomposed into three parts (dropping ^f):

2.3. Generalization Gap

  • The third term is the standard generalization gap commonly studied in machine learning. It is determined solely by the random sampling error.
  • With 10,000 data points (as in the proposed new ImageNet test set), a Clopper-Pearson 95% confidence interval for the test accuracy has size of at most ±1%. Increasing the confidence level to 99.99% yields a confidence interval of size at most ±2%.

2.4. Adaptivity Gap

  • The first term measures how much adapting the model ^f to the test set S causes the test error LS to underestimate the population loss LD.

2.5. Distribution Gap

  • The second term quantifies how much the change from the original distribution D to the new distribution D’ affects the model ^f.
  • This term should be not influenced by random effects but quantifies the systematic difference between sampling the original and new test sets.

One way to test generalization would be to evaluate existing models on new i.i.d. data from the original test distribution.

3. Dataset Construction

The pipeline for the new ImageNet test set
  • Each dataset comes with specific biases. For instance, CIFAR-10 and ImageNet were assembled in the late 2000s, and some classes such as car or cell_phone have changed significantly over the past decade.
  • Such biases are avoided by drawing new images from the same source as CIFAR-10 and ImageNet.

3.1. Gathering Data

Randomly selected images from the original and new CIFAR-10 test sets.
  • For CIFAR-10, this was the larger Tiny Image dataset [55].
  • For ImageNet, the original process of utilizing the Flickr image hosting service is followed and only images uploaded in a similar time frame are considered as for ImageNet.

3.2. Cleaning Data

  • Similar to the original approaches, two graduate students authors of this paper impersonated the CIFAR-10 labelers, and MTurk workers are employed for the new ImageNet test set.
  • After collecting a set of correctly labeled images, the final test sets are sampled from the filtered candidate pool. A test set size of 2,000 for CIFAR-10 and 10,000 for ImageNet are decided.
  • While these are smaller than the original test sets, the sample sizes are still large enough to obtain 95% confidence intervals of about ±1%.

4. Experimental Results

Model accuracy on the original test sets vs. the new test sets
  • The plots reveal two main phenomena:
  1. There is a significant drop in accuracy from the original to the new test sets.
  2. The model accuracies closely follow a linear function with slope greater than 1 (1.7 for CIFAR-10 and 1.1 for ImageNet). This means that every percentage point of progress on the original test set translates into more than one percentage point on the new test set.
  • On CIFAR-10:
  • On ImageNet:
  • In contrast to a scenario with strong adaptive overfitting, neither dataset sees diminishing returns in accuracy scores when going from the original to the new test sets.

The experiments show that the relative order of models is almost exactly preserved on the new test sets: The models with highest accuracy on the original test sets are still the models with highest accuracy on the new test sets. Moreover, there are no diminishing returns in accuracy.

In fact, every percentage point of accuracy improvement on the original test set translates to a larger improvement on the new test sets. So although later models could have been adapted more to the test set, they see smaller drops in accuracy.

(The paper contains 76 pages, I just roughly read it and briefly review it here. If interested, please feel free to read the paper directly.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.