Review — ImageNet-21K Pretraining for the Masses

Semantic Softmax Training Scheme Enhances ImageNet-21K Pretraining

Sik-Ho Tsang
6 min readOct 24, 2022

ImageNet-21K Pretraining for the Masses,
ImageNet-21K Pretraining, by DAMO Academy, Alibaba Group,
2021 NeurIPS, Over 100 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Pretraining, ImageNet

  • Efficient pretraining on ImageNet-21K is proposed with a dedicated preprocessing stage, utilization of WordNet hierarchical structure, and a novel training scheme called semantic softmax.


  1. Problems of Current ImageNet-21K Pretraining
  2. Proposed Pretraining
  3. Experimental Results

1. Problems of Current ImageNet-21K Pretraining

  • A main reason for this discrepancy is that ImageNet-21K labels are not mutually exclusive — the labels are taken from WordNet.
  • A picture, with an actual chair, can sometimes be labeled as “chair”, but sometimes be labeled as the semantic parent of “chair”, “furniture”. This kind of tagging methodology complicates the training process.
  • However, previous works have not methodologically studied and optimized a pretraining process specifically for ImageNet-21K.
  • Other challenges of ImageNet-21Kdataset are the lack of official train-validation split.
  • The raw dataset is large — 1.3TB.

2. Proposed Pretraining

Left: Preprocessing, Middle: WordNet Hierarchical Structure, Right: Semantic Softmax
  • The complete end-to-end pretraining pipeline appears in the above figure.

2.1. Preprocessing ImageNet-21K

  • Step 1 — Cleaning invalid classes: The dataset has no official train-validation split, and the classes are not well-balanced.
  • Classes with few samples cannot be learned efficiently, and hurt the performance.

Thus, infrequent classes, with less than 500 labels, are removed. The dataset contains 12,358,688 images from 11,221 classes. Notice that the cleaning process reduced the number of total classes by half, but removed only 13% of the original pictures.

  • Step 2 — validation split: 50 images are allocated per class for a standardized validation split, that can be used for future benchmarks and comparisons.
  • Step 3 — image resizing: Original images are at full resolution and they are resized on-the-fly. They are now resized during the preprocessing stage all the images to 224 resolution. This stage significantly reduces the dataset’s memory footprint, from 1.3TB to 250GB, and makes loading the data during training faster.

This processed dataset ImageNet-21K-P (P for Processed).

2.2. Hierarchical Label Tree

Examples of classes from different ImageNet-21K-P hierarchies.
  • A ‘swan’, by utilizing the semantic tree, we can produce a list of semantic labels for the image — ’animal, vertebrate, bird, aquatic bird, swan’. Notice that the labels are sorted by hierarchy: ’animal’ label belongs to hierarchy 0, while ’swan’ label belongs to hierarchy 4. A label from hierarchy k has k ancestors.
Example of inconsistent tagging in ImageNet-21K dataset.
  • e.g.: Two pictures above, that contain the animal cow, were labeled differently — one with the label ‘animal’, the other with the label ’cow’.
  • This kind of incomplete tagging methodology, which is common in large datasets [32, 42], hinders and complicates the training process.

By using WordNet synsets, ImageNet-21K-P, has 11 possible hierarchies. The number of classes per hierarchy are presented and used for pretraining.

2.3. Semantic Softmax Pretraining

  • 2 Baselines and 1 proposed pretraining scheme are described here.

2.3.1. Single-Label Training Scheme

  • The straightforward way to pretrain on ImageNet-21K-P is to use the original (single) labels, apply softmax on the output logits, and use cross-entropy loss.
  • However, we are not guaranteed that an image was labeled at the highest possible hierarchy.

2.3.2. Multi-Label Training Scheme

  • Given N labels, the base network outputs one logit per label, zn, and each logit is independently activated by a sigmoid function σ(zn). Let’s denote yn as the ground-truth for class n. The total classification loss, Ltot, is obtained by aggregating a binary loss from the N labels:
  • Each class is learned separately which is an extreme multi-task learning. This makes the optimization process harder and less efficient, and may cause convergences to a local minimum.
  • And there is a large positive-negative imbalance, where on average, classes from a lower hierarchy will appear far more frequent than classes from a higher hierarchy.

2.3.3. Proposed Semantic Softmax Training Scheme

Gradient propagation logic of semantic softmax training.
  • To deal with the partial tagging of ImageNet-21K-P, not all softmax layers will propagate gradients from each sample. Instead, only softmax layers are activated from the relevant hierarchies.
  • Due to the semantic structure, the relative number of occurrence of hierarchy k in the loss function will be:
  • And a normalization factor Wk=1/Ok can be used for each hierarchy k, and a balanced aggregation loss can be obtained:

2.4. Semantic Knowledge Distillation (KD)

  • Knowledge distillation (KD) is a known method to generate soft labels. In this case, it can predict the missing tags that arise from the inconsistent tagging. For the above ‘cow’ example, the teacher model can predict the missing labels — ‘cow, placental, mammal, vertebrate’.
  • To implement semantic KD loss, for each hierarchy, both the teacher and the student the corresponding probability distributions {Ti}, {Si} are calculated, where i from 0 to K-1.
  • The KD loss of hierarchy i will be:
  • where MSE is used for KDLoss.
  • For each hierarchy, the teacher confidence level, Pi, is calculated. A confidence-weighted KD loss will be:
  • Pi is calculated as: If the ground-truth highest hierarchy is higher than i, set Pi to 1. Else, calculate the sum probabilities of the top 5% classes in the teacher prediction.

3. Results

Comparing downstream results for different pretraining schemes.

On 6 out of 7 datasets tested, semantic softmax pretraining outperforms both single-label and multi-label pretraining.

  • In addition, single-label pretraining performs better than multi-label pretraining (scores are higher on 5 out of 7 datasets tested).
Comparing downstream results for different pretraining schemes.

The proposed pretraining scheme significantly outperforms the official ImageNet-21K pretrained weights by ViT and Mixer, on all downstream tasks.

  • Using semantic softmax pretraining, the transfer learning training was more stable and robust, and reached higher accuracy.
Comparing downstream results for ImageNet-1K standard pretraining, and the proposed ImageNet-21K-P pretraining scheme.

The proposed pretraining scheme significantly outperforms standard ImageNet-1K pretraining on all datasets, for all models tested.

  • For example, on iNaturalist dataset we improve the average top-1 accuracy by 2.9%.
  • Notice that some previous works stated that pretraining on a large dataset benefits only large models.

Now, even small mobile-oriented models, like MobileNetV3 and OFA-595, can benefit from pretraining on a large (publicly available) dataset like ImageNet-21K-P.

Large dataset is good. With data cleansing, and proper loss function, it becomes even better.


[2021 NeurIPS] [ImageNet-21K Pretraining]
ImageNet-21K Pretraining for the Masses

1.1. Image Classification

19892021 [ImageNet-21K Pretraining] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.