Brief Review — Distilling Visual Priors from Self-Supervised Learning

Using MoCo v2 as Teacher, Knowledge Distillation for Student, in VIPriors Challenge

Sik-Ho Tsang
4 min readJun 30


VIPriors Challenge (Image from 2020 ECCV Workshop VIPriors Challenge)

Distilling Visual Priors from Self-Supervised Learning
MoCo v2+Distillation, by Tongji University, and Megvii Research Nanjing
2020 ECCV Workshop VIPriors Challenge (Sik-Ho Tsang @ Medium)

Image Classification
1993 … 2022
[BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec]
==== My Other Paper Readings Are Also Over Here ====


  1. Proposed Framework
  2. Results

1. Proposed Framework

Proposed Framework
  • There are 2 phases. Phase-1 for teacher and Phase-2 for student.

1.1. Phase 1: Teacher

  • MoCo v2 is used to train the backbone in self-supervised manner for 800 epochs.
  • The original loss used by MoCo v2 is:
  • MoCo v2 uses a momentum encoder θk to encode all the k and put them in a queue for negative sampling. The momentum encoder is a momentum average of the encoder θq:

In a data-deficient dataset, the maximum size of the queue is limited, authors propose to add a margin to the original loss function to help the model obtain a larger margin between data samples thus help the model obtain a similar result with fewer negative examples:

  • queue size is 4096.

1.2. Phase-2: Self-Distillation on Labeled Dataset

The Distillation process can be seen as a regulation to prevent the student from overfitting the small train dataset. and give the student a more diverse representation for classification.

  • where distance metric dp is l2-distance in this paper.
  • Along with a cross-entropy loss for classification:
  • The final loss function for the student model is:
  • λ=10^(−4). 100 epochs are used for fine-tuning.

2. Results

2.1. Dataset

There are still 1,000 classes but 50 images for each class in each train/val/test split, resulting in a total of 150,000 images.

2.2. Performance

Proposed Framework
  • ResNet-50 is used as backbone.

Finally, by combining phase-1 and phase-2 together, the proposed pipeline achieves 16.7 performance gain in top-1 accuracy over the supervised baseline.

Linear Classifier

The proposed margin loss is less sensitive to the number negatives and can be used in a data-deficient setting.

Bag of Tricks

Several other tricks and stronger backbone models are used for better performance: Larger Resolution, AutoAugment, ResNeXt-101, label-smooth (Inception-v3), 10-Crop, and 2-model ensemble.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.