Review — ciFAIR: Do We Train on Test Data? Purging CIFAR of Near-Duplicates
Do We Train on Test Data? Purging CIFAR of Near-Duplicates,
ciFAIR, by Friedrich Schiller University Jena
2020 MDPI JoE (Sik-Ho Tsang @ Medium)
Image Classification, Image Dataset, CIFAR
- It is found that respectively 3.3% and 10% of the CIFAR-10 and CIFAR-100 images from the test sets of these datasets have duplicates in the training set. So, the model is more likely to be learnt by memorization?
- The “fair CIFAR” (ciFAIR) dataset is constructed, where all duplicates in the test sets are replaced with new images sampled from the same domain.
- CIFAR Duplicates
- Duplicate Statistics
- The Duplicate-Free ciFAIR Test Dataset
- Experimental Results
1. CIFAR Duplicates
- The above figures shows the duplicates in the test set.
- To find out the duplcates, a GUI is develop to show the difference.
- Using the above GUI, the annotator can inspect the test image and its duplicate, their distance in the feature space, and a pixel-wise difference image.
- Exact Duplicate: Almost all pixels in the two images are approximately identical.
- Near-Duplicate: The content of the images is exactly the same, i.e., both originated from the same camera shot. However, different post-processing might have been applied to this original scene, e.g., color shifts, translations, scaling, etc.
- Very Similar: The contents of the two images are different, but highly similar, so that the difference can only be spotted at the second glance.
- Different: The pair does not belong to any other category.
2. Duplicate Statistics
- It is worth noting that there are no exact duplicates in CIFAR-10 at all, as opposed to CIFAR-100.
- There are 891 duplicates from the CIFAR-100 test set in the training set and 104 duplicates within the test set itself. In total, 10% of the test images have duplicates.
- The situation is slightly better for CIFAR-10, where there are 286 duplicates in the training and 39 in the test set, amounting to 3.25% of the test set.
3. The Duplicate-Free ciFAIR Test Dataset
- Each replacement candidate was inspected manually in a graphical user interface, which displayed the candidate and the three nearest neighbors in the feature space from the existing training and test sets.
- The candidates are approved only for inclusion in the new test set when it is not considered as duplicates of any of the three nearest neighbors.
- This modified datasets are as ciFAIR-10 and ciFAIR-100 (“fair CIFAR”).
It is surprising that there are duplicates in the training and testing sets of CIFAR datasets.
[2020 MDPI JoE] [ciFAIR]
Do We Train on Test Data? Purging CIFAR of Near-Duplicates
1989–2019 … 2020: [Random Erasing (RE)] [SAOL] [AdderNet] [FixEfficientNet] [BiT] [RandAugment] [ImageNet-ReaL] [ciFAIR]
2021: [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer]