Brief Review — SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation

SimReg, Distillation for SSL

Sik-Ho Tsang
3 min readSep 13


SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation
, by University of Maryland, and University of California
2021 BMVC (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

  • During distillation of SSL teacher model, it is found that the addition of a multi-layer perceptron (MLP) head to the backbone is beneficial.
  • Deeper MLP can be used to accurately mimic the teacher without changing inference architecture and time.
  • Independent projection heads can be used to simultaneously distill multiple teacher networks.


  1. SimReg
  2. Results

1. SimReg

SimReg Pipeline
  • We have the trained teacher model T. We would like to distill the student model, usually a smaller model by a loss function Lreg:
  • where ft is the feature from teacher and fs is feature from student. By minimizing the loss, we want to make fs closer to ft.
  • d(.) here is the distance metric which is the squared Euclidean distance of l2 normalized features.
  • A prediction MLP g(.) is added after the student. It is found that this can bring performance improvement.
  • During inference, the prediction head g(.) is removed.
  • For multi-teacher scenario, the loss is the average losses of all teachers:
  • where K is the number of teachers.

2. Results

2.1. Ablation Studies

MLP Heads

Deeper MLP head improves the performance.

Augmentation Strategies

Using weak augmentation at both teacher and student models helps for distillation.

2.2. ImageNet

Different Teachers

The proposed simple regression (SimReg) performs comparably or even outperforms the state-of-the-approaches on all settings and metrics, e.g.: CompRess.

2.3. Downstream Tasks

Downstream Classification
  • CompRess-2q-MLP is generally better on ImageNet classification (Table 3) but transfers poorly (Table 5) compared to CompRess-1q-MLP.

However, the same SimReg model performs comparably or outperforms them both in ImageNet and transfer tasks.

Downstream PASCAL VOC

The use of deeper MLP heads during distillation does not aid detection performance. The performance of different distillation architectures is nearly identical on the detection task.

2.4. Multi-Teacher Distillation

Multi-Teacher Distillation
  • A single student model can be trained from multiple teacher networks trained with different SSL methods.

Regression with a 4 layer MLP head (right) significantly outperforms one with linear prediction (left).



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.