Review — SEED: Self-supervised Distillation For Visual Representation
SEED: Self-supervised Distillation For Visual Representation,
SEED, by Arizona State University, and Microsoft Corporation,
2021 ICLR, Over 100 Citations (Sik-Ho Tsang @ Medium)
- SElf-SupErvised Distillation (SEED) is proposed, to leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion.
- Instead of directly learning from unlabeled data, a student encoder is trained to mimic the similarity score distribution inferred by a teacher over a set of instances.
- Motivations & Preliminaries
1. Motivations & Preliminaries
Smaller models with fewer parameters cannot effectively learn instance level discriminative representation with large amount of data.
- Knowledge distillation (KD) is injected to solve this problem.
- When labels are available, KD can be used with supervised learning:
- where xi is an image, yi is the corresponding annotation, θS is the parameter set for the student network, and θT is the set for the teacher network. The loss Lsup is the alignment error between the network prediction and the annotation, which is usually cross-entropy loss for classification.
- The loss of Ldistill is the mimic error of the student network towards a pre-trained teacher network.
- In Model Distillation, the teacher signal comes from the Softmax prediction of multiple large-scale networks and the loss is measured by the Kullback–Leibler (KL) divergence. Some are to align the intermediate feature map values and to minimize the squared l2 distance.
In this paper, KD is considered to be used with self-supervised learning.
- Inspired by contrastive SSL, a simple approach is formulated for the distillation on the basis of instance similarity distribution over a contrastive instance queue.
- An instance queue is maintained for storing data samples’ encoding output from the teacher.
- Given a new sample, its similarity scores are computed with all the samples in the queue using both the teacher and the student models.
Therefore, the student is optimized as minimizing the cross entropy between the student and the teacher’s similarity score distributions.
- Specifically, for a randomly augmented view xi of an image, it is firstly mapped and normalized into feature vector representations:
- where fTθ and fSθ denote the teacher and student encoders, respectively.
- Let D=[d1, …, dK] denote the instance queue where K is the queue length and dj is the feature vector from the teacher encoder. D is progressively updated under the “first-in first-out” strategy as distillation proceeds.
The maintained samples in queue D are mostly random and irrelevant to the target instance xi.
Minimizing the cross entropy between the similarity score distribution computed by the student and teacher based on D softly contrasts xi with randomly selected samples, without directly aligning with the teacher encoder.
- The teacher’s embedding (zTi) is added into the queue and form:
- The queue size of K is 65,536.
- Let pT(xi; θT, D+) denote the similarity score between the extracted teacher feature zTi and dj’s (j=1, …, K+1) computed by the teacher model. pT(xi; θT, D+) is defined as:
- where τT=0.01 is a temperature parameter.
- Similarly, for pS(xi; θS, D+) at student model:
- where τS=0.2.
- The self-supervised distillation can be formulated as minimizing the cross entropy between the similarity scores of the teacher, pT(xi; θT, D+), and the student, pS(xi; θS, D+), over all the instances xi:
- Since the teacher network is pre-trained and frozen, the queued features are consistent during training w.r.t. the student network.
2.6. Relation with InfoNCE Loss
- When τT→0, the softmax function for pT smoothly approaches to a one-hot vector, where pTK+1 equals 1 and all others 0. In this extreme case, the loss becomes:
- which is similar to the widely-used Info-NCE loss.
- Teachers: MoCo v2, SwAV, and SimCLR are used to pre-train the teacher network for 200 epochs.
- ResNet is used as the network backbone with different depths/widths and append a multi-layer-perceptron (MLP) layer (two linear layers and one ReLU activation layer in between) at the end of the encoder after average pooling. The dimension of the last feature dimension is 128.
- Students: Multiple smaller networks with fewer learnable parameters are used: MobileNetV3-Large, EfficientNet-B0, and smaller ResNet with fewer layers (ResNet-18, 34).
- Similar to the pre-training for teacher network, one additional MLP layer is added on the basis of the student network.
- The accuracy is also improved remarkably with SEED distillation, and a stronger teacher network with more parameters leads to a better performed student network.
3.1. Downstream Tasks
SEED distillation surpasses contrastive self-supervised pre-training consistently on all benchmarks, verifying the effectiveness of SEED.
On COCO, the improvement is relatively minor and the reason could be that COCO training set has 118k training images while VOC has only 16.5k training images. A larger training set with more fine-tuning iterations reduces the importance of the initial weights.
3.2. Ablation Studies
Table 3 (Left): SEED is agnostic to pre-training approaches, making it easy to use any self-supervised models (including clustering-based approach like SwAV) in self-supervised distillation. In addition, it is observed that more training epochs for both teacher SSL and distillation epochs can bring beneficial gain.
Figure 5 (Right): Clear performance improvement as depth and width of teacher network increase. However, further architectural enlargement has relatively limited effects, and it is suspected the accuracy might be limited by the student network capacity in this case.
Table 4 (Left): The simple l2-distance minimizing approach can achieve a decent accuracy.
- The original SSL (MoCo v2) supervision as supplementary loss to SEED does not bring additional benefits to distillation. SEED achieves 57.9%, while SEED + MoCo v2 achieves 57.6%. This implies that the loss of SEED can to a large extent cover the original SSL loss, and it is not necessary to conduct SSL any further during distillation.
Table 5: The best τT is a trade-off depending on the data distribution.