Review: ImageNet-V2

A New Test Set for ImageNet

(a) Original ImageNet Validation Set, (b) New Test Set
  • Overfitting is raised initially due to excessively re-used test sets.
  • By closely following the original dataset creation processes, new test sets for the CIFAR-10 and ImageNet datasets are built.
  • A broad range of models are evaluated and there are accuracy drops of 3% — 15% on CIFAR-10 and 11% — 14% on ImageNet.


  1. The Reason to Introduce Extra Test Set
  2. Potential Causes of Accuracy Drops
  3. Dataset Construction
  4. Experimental Results

1. The Reason to Introduce Extra Test Set

  • The overarching goal of machine learning is to produce models that generalize. The generalization is quantified by measuring the performance of a model on a held-out test set.
  • Conventional wisdom suggests that such drops arise because the models have been adapted to the specific images in the original test sets, e.g., via extensive hyperparameter tuning.
  • In this paper, new test set is introduced to answer this question.

2. Potential Causes of Accuracy Drops

2.1. Test Set as Population

  • The standard classification setup is adopted and the existence of a “true” underlying data distribution D over labeled examples (x, y) is posited.
  • The overall goal in classification is to find a model ^f that minimizes the population loss:
  • Since we usually do not know the distribution D, we instead measure the performance of a trained classifier via a test set S drawn from the distribution D:
  • (I think people who know data science will know the above stuff.)

2.2. Decomposition of Loss Difference

  • In this paper, a new test set S’ is collected from a data distribution D’ that is carefully controlled to resemble the original distribution D.
  • Ideally, the original test accuracy LS(^f) and new test accuracy LS’(^f) would then match up to the random sampling error.
  • However, there is accuracy drop using the new test set (Results at the next sections.)
  • To understand this accuracy drop in more detail, the difference between LS(^f) and LS’(^f) is decomposed into three parts (dropping ^f):

2.3. Generalization Gap

  • The third term is the standard generalization gap commonly studied in machine learning. It is determined solely by the random sampling error.
  • With 10,000 data points (as in the proposed new ImageNet test set), a Clopper-Pearson 95% confidence interval for the test accuracy has size of at most ±1%. Increasing the confidence level to 99.99% yields a confidence interval of size at most ±2%.

2.4. Adaptivity Gap

  • The first term measures how much adapting the model ^f to the test set S causes the test error LS to underestimate the population loss LD.

2.5. Distribution Gap

  • The second term quantifies how much the change from the original distribution D to the new distribution D’ affects the model ^f.
  • This term should be not influenced by random effects but quantifies the systematic difference between sampling the original and new test sets.

3. Dataset Construction

The pipeline for the new ImageNet test set
  • Each dataset comes with specific biases. For instance, CIFAR-10 and ImageNet were assembled in the late 2000s, and some classes such as car or cell_phone have changed significantly over the past decade.
  • Such biases are avoided by drawing new images from the same source as CIFAR-10 and ImageNet.

3.1. Gathering Data

Randomly selected images from the original and new CIFAR-10 test sets.
  • For CIFAR-10, this was the larger Tiny Image dataset [55].
  • For ImageNet, the original process of utilizing the Flickr image hosting service is followed and only images uploaded in a similar time frame are considered as for ImageNet.

3.2. Cleaning Data

  • Similar to the original approaches, two graduate students authors of this paper impersonated the CIFAR-10 labelers, and MTurk workers are employed for the new ImageNet test set.
  • After collecting a set of correctly labeled images, the final test sets are sampled from the filtered candidate pool. A test set size of 2,000 for CIFAR-10 and 10,000 for ImageNet are decided.
  • While these are smaller than the original test sets, the sample sizes are still large enough to obtain 95% confidence intervals of about ±1%.

4. Experimental Results

Model accuracy on the original test sets vs. the new test sets
  • The plots reveal two main phenomena:
  1. There is a significant drop in accuracy from the original to the new test sets.
  2. The model accuracies closely follow a linear function with slope greater than 1 (1.7 for CIFAR-10 and 1.1 for ImageNet). This means that every percentage point of progress on the original test set translates into more than one percentage point on the new test set.
  • On CIFAR-10:
  • On ImageNet:
  • In contrast to a scenario with strong adaptive overfitting, neither dataset sees diminishing returns in accuracy scores when going from the original to the new test sets.



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store