Brief Review — Distilling Visual Priors from Self-Supervised Learning

Using MoCo v2 as Teacher, Knowledge Distillation for Student, in VIPriors Challenge

4 min readJun 30, 2023

**VIPriors Challenge** (Image from 2020 ECCV Workshop VIPriors Challenge)

Distilling Visual Priors from Self-Supervised Learning
MoCo v2+Distillation, by Tongji University, and Megvii Research Nanjing
2020 ECCV Workshop VIPriors Challenge (Sik-Ho Tsang @ Medium)
Image Classification
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec]
==== My Other Paper Readings Are Also Over Here ====

This is a paper participating in the “Visual Inductive Priors for Data-Efficient Computer Vision” Challenges (VIPriors) in 2020 ECCV Workshop, which is under the data-deficient scenario.
The first phase is to learn a teacher model using MoCo v2.
The second phase is to distill the representations into a student model in a self-Distillation manner.

Outline

Proposed Framework
Results

1. Proposed Framework

There are 2 phases. Phase-1 for teacher and Phase-2 for student.

1.1. Phase 1: Teacher

MoCo v2 is used to train the backbone in self-supervised manner for 800 epochs.
The original loss used by MoCo v2 is:

MoCo v2 uses a momentum encoder θk to encode all the k and put them in a queue for negative sampling. The momentum encoder is a momentum average of the encoder θq:

In a data-deficient dataset, the maximum size of the queue is limited, authors propose to add a margin to the original loss function to help the model obtain a larger margin between data samples thus help the model obtain a similar result with fewer negative examples:

queue size is 4096.

1.2. Phase-2: Self-Distillation on Labeled Dataset

The Distillation process can be seen as a regulation to prevent the student from overfitting the small train dataset. and give the student a more diverse representation for classification.

Following OFD [9], the Distillation loss is:

where distance metric dp is l2-distance in this paper.
Along with a cross-entropy loss for classification:

The final loss function for the student model is:

λ=10^(−4). 100 epochs are used for fine-tuning.

2. Results

2.1. Dataset

A small subset of ImageNet is used as the VIPrior challenge dataset.

There are still 1,000 classes but 50 images for each class in each train/val/test split, resulting in a total of 150,000 images.

2.2. Performance

ResNet-50 is used as backbone.

Finally, by combining phase-1 and phase-2 together, the proposed pipeline achieves 16.7 performance gain in top-1 accuracy over the supervised baseline.

The proposed margin loss is less sensitive to the number negatives and can be used in a data-deficient setting.

Several other tricks and stronger backbone models are used for better performance: Larger Resolution, AutoAugment, ResNeXt-101, label-smooth (Inception-v3), 10-Crop, and 2-model ensemble.

Brief Review — Distilling Visual Priors from Self-Supervised Learning

Using MoCo v2 as Teacher, Knowledge Distillation for Student, in VIPriors Challenge

Outline

1. Proposed Framework

1.1. Phase 1: Teacher

1.2. Phase-2: Self-Distillation on Labeled Dataset

2. Results

2.1. Dataset

2.2. Performance

Written by Sik-Ho Tsang

No responses yet