Brief Review — Distilling Visual Priors from Self-Supervised Learning
Using MoCo v2 as Teacher, Knowledge Distillation for Student, in VIPriors Challenge
Distilling Visual Priors from Self-Supervised Learning
MoCo v2+Distillation, by Tongji University, and Megvii Research Nanjing
2020 ECCV Workshop VIPriors Challenge (Sik-Ho Tsang @ Medium)
- This is a paper participating in the “Visual Inductive Priors for Data-Efficient Computer Vision” Challenges (VIPriors) in 2020 ECCV Workshop, which is under the data-deficient scenario.
- The first phase is to learn a teacher model using MoCo v2.
- The second phase is to distill the representations into a student model in a self-Distillation manner.
- Proposed Framework
1. Proposed Framework
- There are 2 phases. Phase-1 for teacher and Phase-2 for student.
1.1. Phase 1: Teacher
- MoCo v2 is used to train the backbone in self-supervised manner for 800 epochs.
- The original loss used by MoCo v2 is:
- MoCo v2 uses a momentum encoder θk to encode all the k and put them in a queue for negative sampling. The momentum encoder is a momentum average of the encoder θq:
In a data-deficient dataset, the maximum size of the queue is limited, authors propose to add a margin to the original loss function to help the model obtain a larger margin between data samples thus help the model obtain a similar result with fewer negative examples:
- queue size is 4096.
1.2. Phase-2: Self-Distillation on Labeled Dataset
The Distillation process can be seen as a regulation to prevent the student from overfitting the small train dataset. and give the student a more diverse representation for classification.
- Following OFD , the Distillation loss is:
- where distance metric dp is l2-distance in this paper.
- Along with a cross-entropy loss for classification:
- The final loss function for the student model is:
- λ=10^(−4). 100 epochs are used for fine-tuning.
- A small subset of ImageNet is used as the VIPrior challenge dataset.
There are still 1,000 classes but 50 images for each class in each train/val/test split, resulting in a total of 150,000 images.
- ResNet-50 is used as backbone.
Finally, by combining phase-1 and phase-2 together, the proposed pipeline achieves 16.7 performance gain in top-1 accuracy over the supervised baseline.
The proposed margin loss is less sensitive to the number negatives and can be used in a data-deficient setting.