Review — Fair Visual Recognition in Limited Data Regime using Self-Supervision and Self-Distillation

LDBM, Self-Supervised Learning (SSL) + Self-Distillation (SD)

Sik-Ho Tsang
6 min readMay 23, 2023

Fair Visual Recognition in Limited Data Regime using Self-Supervision and Self-Distillation,
LDBM, by IIT Kanpur, IIT Roorkee, and University of Bath,
2022 WACV (

@ Medium)

Self-Supervised Learning
1993 … 2022
[BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

  • Bias mitigation techniques assume that a sufficiently large number of training examples are present. If the training data is limited, bias mitigation techniques are severely degraded.
  • Self-supervision and self-distillation are proposed, which is also the first time, to reduce the impact of biases on the model.


  1. CIFAR-10 Dataset with Bias
  2. Limited Data Bias Mitigation approach (LDBM)
  3. Results

1. CIFAR-10 Dataset with Bias

1.1. CIFAR-10S (CIFAR-10 Skewed) [27]

  • CIFAR-10S is proposed by [27], which retains the 10 classes and 50,000 training images of the CIFAR-10 [16] dataset but modifies the training images to introduce a bias using two domains, i.e., color and grayscale.
  • The color domain has five classes, each containing 95% color and 5% grayscale images (95–5% skew).
  • The remaining five classes belong to the grayscale domain and contain 95% grayscale and 5% color images each (95–5% skew).
  • CIFAR-10S has two separate test datasets that basically are the color and grayscale copies of the entire CIFAR-10 test set, respectively.

The model performance has to be de-biased, to be high.

1.2. Proposed L-CIFAR-10S (Limited CIFAR-10 Skewed)

  • A limited data version (L-CIFAR-10S) of the CIFAR-10S dataset is proposed, obtained by reducing the number of images per class to 5%.

This dataset has the condition of limited/low data regime, which is even more difficult.

1.3. Effect of Biases with Limited vs. Full Data

ResNet-18 model trained on (a) CIFAR-10S (b) L-CIFAR-10S.
  • The baseline ResNet-18 network on the L-CIFAR-10S dataset, the resulting model achieves 41.04%.
  • The classification accuracy on the color test set. This is significantly lower as compared to the 89.0% achieved by training on the CIFAR-10S dataset.
  • If we convert all the training images in L-CIFAR-10S to grayscale, accuracy of 65.57% is obtained.
  • ResNet-18 model trained on an all color version, 66.53% is obtained.

The bias in L-CIFAR-10S plays a major role in the significantly lower performance of the baseline (65.57/66.53%→41.04%).

2. Limited Data Bias Mitigation approach (LDBM)

  • Limited Data Bias Mitigation approach (LDBM) is proposed, which uses (2.1) self-supervision and (2.2) self-distillation for bias mitigation.

These techniques help the model learn useful/meaningful and discriminative features from the training data, thereby minimizing the impact of irrelevant features on the model.

  • Consequently, if the training data suffers from biases due to any unwanted correlations, these techniques will also prevent the model from learning such biases, which is the objective of bias mitigation.
  • In the limited data setting, this problem becomes even more severe.

2.1. SimSiam: Self-Supervised Learning

SimSiam: Self-Supervised Learning
  • (If you know SimSiam well, please skip this part.)
  • The two views of the same image are processed by an encoder (ESIM) consisting of a backbone network and a multi-layer perceptron projection head.
  • The encoder shares the same parameter weights between the two views. Let x1i, x2i refer to two randomly augmented views of the same image xi. The embedding of view 2 is:
  • A multi-layer perceptron based prediction head (PSIM) transforms the encoder output of the first view to match the second view (target view).
  • The negative cosine similarity loss is used:
  • Consider also the transform of the second view to match the first view, the final SimSiam loss LSIM is:

2.2. Self-Distillation from a Trained Teacher

  • The Kullback–Leibler(KL) divergence is minimized between the logits/soft probability outputs of the teacher and student networks (as shown in Fig. 3). This process transfers knowledge from the teacher to the student network and also improves the generalization capacity of the student network.
  • The loss function LSD is:
  • where σ(.) denotes the softmax activation to transform the logit into a probability.

2.3. LDBM

  • A teacher network ΘT is trained using the cross-entropy loss for the standard classification task and the SimSiam auxiliary self-supervised task loss.
  • The total loss LT for training the teacher network is defined as follows:
  • where N is the number of training samples.
  • A student network ΘS that has the same architecture as the teacher network ΘT using the cross-entropy loss for the standard classification task, the SimSiam auxiliary self-supervised task loss, and the self distillation loss from the trained teacher network ΘT.
  • The total loss LS for training the student network is defined as follows:
  • After the training is completed, the resulting student network is the final model.

3. Results

3.1. Dataset Variants

  • CIFAR-10S-i: images of the same class from the ImageNet [21] with downsampling to 32×32. (This part I am not so clear. Please feel free to tell me if anyone knows.)
  • CIFAR-10S-c28: images cropped to 28×28 and upsampled to 32×32.
  • CIFAR-10S-d16: images downsampled to 16×16 and upsampled to 32×32.
  • CIFAR-10S-d8: images downsampled to 8×8 and upsampled to 32×32.
  • L-CIFAR-10S-i, L-CIFAR-10S-c28, L-CIFAR-10S-d16, and L-CIFAR-10S-d8: Limited data versions of above datasets.
  • L-CelebA: CelebA but reducing the training set to the first 5% images.

3.2. Performance Comparison

Performance Comparison
  • L-CIFAR-10S dataset variants (Table 1–5): ResNet-18 is used.

The proposed LDBM approach significantly improves the performance of the baseline model for all L-CIFAR-10S dataset variants.

  • L-CelebA (Table 6): ResNet-50 architecture pretrained on ImageNet is used.

LDBM approach significantly improves the performance of the baseline approach and the domain-independent training approach with respect to average mAP across attributes and the bias amplification score.

3.3. Ablation Studies

Ablation Studies

Figure 4: The baseline w/ LDBM encounters lower test loss than the baseline.

Figure 5: The decrease in the bias amplification score as the model gets trained.

Ablation Studies

Table 7: Both the auxiliary self-supervision and the self-distillation components are essential.

Table 8: SimSiam is better than RotNet and SimCLR.

Figure 6: As the level of skew increases, the performance of the baseline decreases significantly due to an increase in the bias in the data, as expected. Also, LDBM can complement other bias mitigation approaches.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.