Review — ciFAIR: Do We Train on Test Data? Purging CIFAR of Near-Duplicates

A Duplicate-Free Variant of the CIFAR Test Set

3 min readMar 8, 2022

Do We Train on Test Data? Purging CIFAR of Near-Duplicates,
ciFAIR, by Friedrich Schiller University Jena
2020 MDPI JoE (Sik-Ho Tsang @ Medium)
Image Classification, Image Dataset, CIFAR

It is found that respectively 3.3% and 10% of the CIFAR-10 and CIFAR-100 images from the test sets of these datasets have duplicates in the training set. So, the model is more likely to be learnt by memorization?
The “fair CIFAR” (ciFAIR) dataset is constructed, where all duplicates in the test sets are replaced with new images sampled from the same domain.

Outline

CIFAR Duplicates
Duplicate Statistics
The Duplicate-Free ciFAIR Test Dataset
Experimental Results

1. CIFAR Duplicates

**Examples for different types of duplicates between the CIFAR-100 test and training set**

The above figures shows the duplicates in the test set.
To find out the duplcates, a GUI is develop to show the difference.

Using the above GUI, the annotator can inspect the test image and its duplicate, their distance in the feature space, and a pixel-wise difference image.
Exact Duplicate: Almost all pixels in the two images are approximately identical.
Near-Duplicate: The content of the images is exactly the same, i.e., both originated from the same camera shot. However, different post-processing might have been applied to this original scene, e.g., color shifts, translations, scaling, etc.
Very Similar: The contents of the two images are different, but highly similar, so that the difference can only be spotted at the second glance.
Different: The pair does not belong to any other category.

2. Duplicate Statistics

**Duplicates per type between test and training set (blue) and within the test set (orange)**

It is worth noting that there are no exact duplicates in CIFAR-10 at all, as opposed to CIFAR-100.
There are 891 duplicates from the CIFAR-100 test set in the training set and 104 duplicates within the test set itself. In total, 10% of the test images have duplicates.
The situation is slightly better for CIFAR-10, where there are 286 duplicates in the training and 39 in the test set, amounting to 3.25% of the test set.

**The classes with the most duplicates**

3. The Duplicate-Free ciFAIR Test Dataset

**GUI for replacement candidate selection**

Each replacement candidate was inspected manually in a graphical user interface, which displayed the candidate and the three nearest neighbors in the feature space from the existing training and test sets.
The candidates are approved only for inclusion in the new test set when it is not considered as duplicates of any of the three nearest neighbors.
This modified datasets are as ciFAIR-10 and ciFAIR-100 (“fair CIFAR”).

4. Experimental Results

**Classification error rate of various CNN architectures on the original CIFAR test sets and the modified ciFAIR test sets**

There is a significant drop in classification accuracy of between 9% and 14% relative to the original performance on the duplicate-free test set using different kinds of networks such as ResNet, WRN, DenseNet, ResNeXt, and PyramidNet.

It is surprising that there are duplicates in the training and testing sets of CIFAR datasets.