Brief Review — How Transferable are Self-supervised Features in Medical Image Classification Tasks?

DVME, Aggregating SimCLR, SwAV, and DINO Features

Sik-Ho Tsang
3 min readNov 30, 2022


How Transferable are Self-supervised Features in Medical Image Classification Tasks?,
, by Bayer AG, Germany
2021 ML4H (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Medical Image Analysis

  1. Dynamic Visual Meta-Embedding (DVME) is proposed, boosting the performance by aggregating multiple feature information from SimCLR, SwAV, and DINO feature vectors.


  1. Dynamic Visual Meta-Embedding (DVME)
  2. Results

1. Dynamic Visual Meta-Embedding (DVME)

Dynamic Visual Meta-Embedding (DVME)

Dynamic Visual Meta-Embedding (DVME) is proposed, aggregating information by concatenating the embeddings of SimCLR, SwAV, and DINO pretrained models.

  • The embedding space is extracted from the last fully connected unit from SimCLR and SwAV with the dimension 2048.
  • For DINO, the embedding is constructed by concatenating the class token of the last four blocks results in the dimension of 1536.
  • Then, each embedding is projected into a fixed size of 512 and fed the concatenation of the resulting embedding into a self-attention module, which is the same as the ViT one, except that the attention is learned across different components of the meta-embedding instead of image patches.
  • The embedding space from attention module is concatenated and projected to a fixed dimension of 512.
  • The importance of each embedding component is learnt for a specific downstream task.

2. Results

2.1. Datasets

Number of samples for different subtasks
  • There are 4 datasets. Each generate small (S), medium (M) and full datasets, for low data regime (S, M) evaluation.

2.2. Individual Self-Supervised Learning Approach

Linear evaluation (Left) and fine-tuning (Right) performance of different self-supervised initializations
  • SwAV and SimCLR pretrained features yield inconsistent patterns across all downstream tasks.
  • DINO initialization consistently outperforms all the other initializations across all tasks by a significant margin.
  • A higher performance for all self-supervised pretrained initializations compared to the supervised pretrained and randomly initialized baselines in the low data regimes.

This suggests that the representation generated by self-supervised methods are of higher quality, leading to better performance on the test set and reducing the performance variability between folds in low data regimes.

2.3. Proposed DVME

Linear evaluation performance of Dynamic Visual Meta-Embedding (DVME)

DVME outperforms this benchmark in 4/4 of the S subtasks, 3/4 of the M subtasks, and 2/4 F subtasks.

  • The improvement of DVME over the benchmark is particularly pronounced for the APTOS and NIH Chest X-ray tasks. For example,
  • DVME helps gain roughly 6% in Kappa score over the best individual baseline for the S and M subtask of the APTOS dataset.

2.4. t-SNE Visualization

t-SNE visualization of embeddings obtained using different pretrained feature extractors
  • DINO offers a clear class separation compare to its supervised counterpart.

The DVME clusters are better separated, particularly in the case of multiclass classification.


[2021 ML4H] [DVME]
How Transferable are Self-supervised Features in Medical Image Classification Tasks?

1.2. Unsupervised/Self-Supervised Learning

19932021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] [DVME] 2022 [BEiT] [BEiT V2]

1.9. Biomedical Image Classification

20172021 [MICLe] [MoCo-CXR] [CheXternal] [CheXtransfer] [Ciga JMEDIA’21] [DVME]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.