Brief Review — SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation

SimReg, Distillation for SSL

3 min readSep 13, 2023

SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation
SimReg, by University of Maryland, and University of California
2021 BMVC (Sik-Ho Tsang @ Medium)
Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====

During distillation of SSL teacher model, it is found that the addition of a multi-layer perceptron (MLP) head to the backbone is beneficial.
Deeper MLP can be used to accurately mimic the teacher without changing inference architecture and time.
Independent projection heads can be used to simultaneously distill multiple teacher networks.

Outline

SimReg
Results

1. SimReg

We have the trained teacher model T. We would like to distill the student model, usually a smaller model by a loss function Lreg:

where ft is the feature from teacher and f’s is feature from student. By minimizing the loss, we want to make f’s closer to ft.
d(.) here is the distance metric which is the squared Euclidean distance of l2 normalized features.
A prediction MLP g(.) is added after the student. It is found that this can bring performance improvement.
During inference, the prediction head g(.) is removed.
For multi-teacher scenario, the loss is the average losses of all teachers:

where K is the number of teachers.

2. Results

2.1. Ablation Studies

Deeper MLP head improves the performance.

Using weak augmentation at both teacher and student models helps for distillation.

2.2. ImageNet

The proposed simple regression (SimReg) performs comparably or even outperforms the state-of-the-approaches on all settings and metrics, e.g.: CompRess.

2.3. Downstream Tasks

CompRess-2q-MLP is generally better on ImageNet classification (Table 3) but transfers poorly (Table 5) compared to CompRess-1q-MLP.

However, the same SimReg model performs comparably or outperforms them both in ImageNet and transfer tasks.

The use of deeper MLP heads during distillation does not aid detection performance. The performance of different distillation architectures is nearly identical on the detection task.