Brief Review — How Transferable are Self-supervised Features in Medical Image Classification Tasks?

DVME, Aggregating SimCLR, SwAV, and DINO Features

3 min readNov 30, 2022

How Transferable are Self-supervised Features in Medical Image Classification Tasks?,
DVME, by Bayer AG, Germany
2021 ML4H (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification, Medical Image Analysis

Dynamic Visual Meta-Embedding (DVME) is proposed, boosting the performance by aggregating multiple feature information from SimCLR, SwAV, and DINO feature vectors.

Outline

Dynamic Visual Meta-Embedding (DVME)
Results

1. Dynamic Visual Meta-Embedding (DVME)

Dynamic Visual Meta-Embedding (DVME) is proposed, aggregating information by concatenating the embeddings of SimCLR, SwAV, and DINO pretrained models.

The embedding space is extracted from the last fully connected unit from SimCLR and SwAV with the dimension 2048.
For DINO, the embedding is constructed by concatenating the class token of the last four blocks results in the dimension of 1536.
Then, each embedding is projected into a fixed size of 512 and fed the concatenation of the resulting embedding into a self-attention module, which is the same as the ViT one, except that the attention is learned across different components of the meta-embedding instead of image patches.
The embedding space from attention module is concatenated and projected to a fixed dimension of 512.
The importance of each embedding component is learnt for a specific downstream task.

2. Results

2.1. Datasets

**Number of samples for different subtasks**

There are 4 datasets. Each generate small (S), medium (M) and full datasets, for low data regime (S, M) evaluation.

2.2. Individual Self-Supervised Learning Approach

**Linear evaluation (Left) and fine-tuning (Right) performance of different self-supervised initializations**

SwAV and SimCLR pretrained features yield inconsistent patterns across all downstream tasks.
DINO initialization consistently outperforms all the other initializations across all tasks by a significant margin.
A higher performance for all self-supervised pretrained initializations compared to the supervised pretrained and randomly initialized baselines in the low data regimes.

This suggests that the representation generated by self-supervised methods are of higher quality, leading to better performance on the test set and reducing the performance variability between folds in low data regimes.

2.3. Proposed DVME

**Linear evaluation performance of Dynamic Visual Meta-Embedding (DVME)**

DVME outperforms this benchmark in 4/4 of the S subtasks, 3/4 of the M subtasks, and 2/4 F subtasks.

The improvement of DVME over the benchmark is particularly pronounced for the APTOS and NIH Chest X-ray tasks. For example,
DVME helps gain roughly 6% in Kappa score over the best individual baseline for the S and M subtask of the APTOS dataset.