Brief Review — Distilling Visual Priors from Self-Supervised Learning
Using MoCo v2 as Teacher, Knowledge Distillation for Student, in VIPriors Challenge
Distilling Visual Priors from Self-Supervised Learning
MoCo v2+Distillation, by Tongji University, and Megvii Research Nanjing
2020 ECCV Workshop VIPriors Challenge (Sik-Ho Tsang @ Medium)Image Classification
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec]
==== My Other Paper Readings Are Also Over Here ====
- This is a paper participating in the “Visual Inductive Priors for Data-Efficient Computer Vision” Challenges (VIPriors) in 2020 ECCV Workshop, which is under the data-deficient scenario.
- The first phase is to learn a teacher model using MoCo v2.
- The second phase is to distill the representations into a student model in a self-Distillation manner.
Outline
- Proposed Framework
- Results
1. Proposed Framework
- There are 2 phases. Phase-1 for teacher and Phase-2 for student.
1.1. Phase 1: Teacher
- MoCo v2 is used to train the backbone in self-supervised manner for 800 epochs.
- The original loss used by MoCo v2 is:
- MoCo v2 uses a momentum encoder θk to encode all the k and put them in a queue for negative sampling. The momentum encoder is a momentum average of the encoder θq:
In a data-deficient dataset, the maximum size of the queue is limited, authors propose to add a margin to the original loss function to help the model obtain a larger margin between data samples thus help the model obtain a similar result with fewer negative examples:
- queue size is 4096.
1.2. Phase-2: Self-Distillation on Labeled Dataset
The Distillation process can be seen as a regulation to prevent the student from overfitting the small train dataset. and give the student a more diverse representation for classification.
- Following OFD [9], the Distillation loss is:
- where distance metric dp is l2-distance in this paper.
- Along with a cross-entropy loss for classification:
- The final loss function for the student model is:
- λ=10^(−4). 100 epochs are used for fine-tuning.
2. Results
2.1. Dataset
- A small subset of ImageNet is used as the VIPrior challenge dataset.
There are still 1,000 classes but 50 images for each class in each train/val/test split, resulting in a total of 150,000 images.
2.2. Performance
- ResNet-50 is used as backbone.
Finally, by combining phase-1 and phase-2 together, the proposed pipeline achieves 16.7 performance gain in top-1 accuracy over the supervised baseline.
The proposed margin loss is less sensitive to the number negatives and can be used in a data-deficient setting.
Several other tricks and stronger backbone models are used for better performance: Larger Resolution, AutoAugment, ResNeXt-101, label-smooth (Inception-v3), 10-Crop, and 2-model ensemble.