Review — ciFAIR: Do We Train on Test Data? Purging CIFAR of Near-Duplicates
A Duplicate-Free Variant of the CIFAR Test Set
Do We Train on Test Data? Purging CIFAR of Near-Duplicates,
ciFAIR, by Friedrich Schiller University Jena
2020 MDPI JoE (Sik-Ho Tsang @ Medium)
Image Classification, Image Dataset, CIFAR
- It is found that respectively 3.3% and 10% of the CIFAR-10 and CIFAR-100 images from the test sets of these datasets have duplicates in the training set. So, the model is more likely to be learnt by memorization?
- The “fair CIFAR” (ciFAIR) dataset is constructed, where all duplicates in the test sets are replaced with new images sampled from the same domain.
Outline
- CIFAR Duplicates
- Duplicate Statistics
- The Duplicate-Free ciFAIR Test Dataset
- Experimental Results
1. CIFAR Duplicates
- The above figures shows the duplicates in the test set.
- To find out the duplcates, a GUI is develop to show the difference.
- Using the above GUI, the annotator can inspect the test image and its duplicate, their distance in the feature space, and a pixel-wise difference image.
- Exact Duplicate: Almost all pixels in the two images are approximately identical.
- Near-Duplicate: The content of the images is exactly the same, i.e., both originated from the same camera shot. However, different post-processing might have been applied to this original scene, e.g., color shifts, translations, scaling, etc.
- Very Similar: The contents of the two images are different, but highly similar, so that the difference can only be spotted at the second glance.
- Different: The pair does not belong to any other category.
2. Duplicate Statistics
- It is worth noting that there are no exact duplicates in CIFAR-10 at all, as opposed to CIFAR-100.
- There are 891 duplicates from the CIFAR-100 test set in the training set and 104 duplicates within the test set itself. In total, 10% of the test images have duplicates.
- The situation is slightly better for CIFAR-10, where there are 286 duplicates in the training and 39 in the test set, amounting to 3.25% of the test set.
3. The Duplicate-Free ciFAIR Test Dataset
- Each replacement candidate was inspected manually in a graphical user interface, which displayed the candidate and the three nearest neighbors in the feature space from the existing training and test sets.
- The candidates are approved only for inclusion in the new test set when it is not considered as duplicates of any of the three nearest neighbors.
- This modified datasets are as ciFAIR-10 and ciFAIR-100 (“fair CIFAR”).
4. Experimental Results
There is a significant drop in classification accuracy of between 9% and 14% relative to the original performance on the duplicate-free test set using different kinds of networks such as ResNet, WRN, DenseNet, ResNeXt, and PyramidNet.
It is surprising that there are duplicates in the training and testing sets of CIFAR datasets.
Reference
[2020 MDPI JoE] [ciFAIR]
Do We Train on Test Data? Purging CIFAR of Near-Duplicates
Image Classification
1989–2019 … 2020: [Random Erasing (RE)] [SAOL] [AdderNet] [FixEfficientNet] [BiT] [RandAugment] [ImageNet-ReaL] [ciFAIR]
2021: [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer]