Brief Review — SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation
SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation
SimReg, by University of Maryland, and University of California
2021 BMVC (Sik-Ho Tsang @ Medium)
- During distillation of SSL teacher model, it is found that the addition of a multi-layer perceptron (MLP) head to the backbone is beneficial.
- Deeper MLP can be used to accurately mimic the teacher without changing inference architecture and time.
- Independent projection heads can be used to simultaneously distill multiple teacher networks.
- We have the trained teacher model T. We would like to distill the student model, usually a smaller model by a loss function Lreg:
- where ft is the feature from teacher and f’s is feature from student. By minimizing the loss, we want to make f’s closer to ft.
- d(.) here is the distance metric which is the squared Euclidean distance of l2 normalized features.
- A prediction MLP g(.) is added after the student. It is found that this can bring performance improvement.
- During inference, the prediction head g(.) is removed.
- For multi-teacher scenario, the loss is the average losses of all teachers:
- where K is the number of teachers.
2.1. Ablation Studies
Deeper MLP head improves the performance.
Using weak augmentation at both teacher and student models helps for distillation.
The proposed simple regression (SimReg) performs comparably or even outperforms the state-of-the-approaches on all settings and metrics, e.g.: CompRess.
2.3. Downstream Tasks
- CompRess-2q-MLP is generally better on ImageNet classification (Table 3) but transfers poorly (Table 5) compared to CompRess-1q-MLP.
However, the same SimReg model performs comparably or outperforms them both in ImageNet and transfer tasks.
The use of deeper MLP heads during distillation does not aid detection performance. The performance of different distillation architectures is nearly identical on the detection task.
2.4. Multi-Teacher Distillation
- A single student model can be trained from multiple teacher networks trained with different SSL methods.
Regression with a 4 layer MLP head (right) significantly outperforms one with linear prediction (left).