Brief Review — Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

FD-CLIP, FD-SwinV2, Feature Distilling CLIP & SwinV2

4 min readDec 12, 2024

Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, by Tsinghua University, Microsoft Research Asia
FD-CLIP, FD-SwinV2, 2022 arXiv v3, Over 130 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM] [data2vec] [SEER 10B, RG-10B] [iBOT]
==== My Other Paper Readings Are Also Over Here ====

Pretrained models have inferior fine-tuning performance.
It is found that Feature Distillation (FD) significantly improves these pre-training approaches, by converting the old representations to new representations.

Outline

FD-CLIP & FD-SwinV2
Visualizations
Results

1. FD-CLIP & FD-SwinV2

1.1. Feature Distillation (FD)

The goal is to obtain a new representation that distills knowledge from the already pre-trained model while being more friendly to fine-tuning.

In this method, the already pre-trained model plays the teacher, and the new model plays the student.

1.1. Distilling feature maps so as to be generic

The output feature map of the pre-trained model is adopted as the distillation target. Using the feature map as the distilling target allows us to work with any pre-trained model that may not have a logit output.

Distilling the feature map also shows higher fine-tuning accuracy than using logits or the reduced single feature vector.
A 1 × 1 convolution layer is applied on top of the student network to allow different dimensions of output feature maps between the teacher and student.

1.2. Whitening teacher features for distillation

Different pre-trained models may have very different orders of feature magnitudes.

The output feature map of the teacher network is normalized by a whitening operation, which is implemented by a non-parametric layer normalization operator without scaling and bias.

where s and t are output feature vectors of the student and teacher networks, respectively; g is a 1 × 1 convolution layer.

1.3. Shared relative position bias

A shared RPB configuration is used, where all layers share the same relative positional bias matrices. We find that the shared RPB performed best overall.

1.4. Asymmetric drop path rates

Specifically, on ViT-B, the strategy of applying a drop path rate of 0.1–0.3 on the student branch, and no drop path regularization on the teacher branch works best.

2. Visualizations

Figure 2: For all pre-trained representations before distillation, the attention distances of different heads in deeper layers collapse to locate within a very small distance range. This suggests that different heads learn very similar visual cues and may be wasting model capacity.
After distillation process, all representations become more diverse or more evenly distributed regarding the attention distance, especially for deeper layers.
Figure 3: This observation is also reflected by Figure 3, which calculates the average cosine similarity between attention heads of each layer.
Figure 4: Representations after feature distillation have much more diagonal patterns, which means the model relies more on visual cues that encode relationship of relative locations.

3. Results

With the feature distillation method, all listed pre-trained models are improved by 1.0%∼2.0% on ImageNet-1K fine-tuning and 1.0∼3.3 mIoU on ADE20K semantic segmentation.

FD also improves the 3-billion-parameter SwinV2-G to achieve 61.4 mIoU and 64.2 mAP on ADE20K semantic segmentation and COCO object detection (using the same UperNet / HTC++ framework and the same evaluation settings as the original SwinV2).

Brief Review — Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

FD-CLIP, FD-SwinV2, Feature Distilling CLIP & SwinV2

Outline

1. FD-CLIP & FD-SwinV2

1.1. Feature Distillation (FD)

1.1. Distilling feature maps so as to be generic

1.2. Whitening teacher features for distillation

1.3. Shared relative position bias

1.4. Asymmetric drop path rates

2. Visualizations

3. Results

Written by Sik-Ho Tsang

No responses yet