Brief Review — Contrastive Learning of Medical Visual Representations from Paired Images and Text

ConVIRT, Image+Text Contrastive Learning, Outperforms Image-Based Contrastive Learning SimCLR & MoCo v2

Sik-Ho Tsang
5 min readAug 1, 2022
Examples of Paired Images and Texts

Contrastive Learning of Medical Visual Representations from Paired Images and Text, ConVIRT, by Stanford University,
2020 arXiv v1, Over 100 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Contrastive Learning, Medical Image Classification, Image Retrieval

  • Contrastive VIsual Representation Learning from Text (ConVIRT), is proposed, which pretrains medical image encoders with the paired image and text data via a bidirectional contrastive objective between the two modalities.


  1. Contrastive VIsual Representation Learning from Text (ConVIRT)
  2. Results

1. Contrastive VIsual Representation Learning from Text (ConVIRT)

Overview of ConVIRT framework

1.1. Overall Framework

  • At a high level, each input image xv and text xu are converted into d-dimensional vector representations v and u respectively, following a similar processing pipeline.
  • For each input image xv, our method starts by drawing a random view ~xv from xv with a sampled transformation function tv~T, where T represents a family of stochastic image transformation functions described later.
  • Next, the encoder function fv transforms ~xv into a fixed-dimensional vector hv, followed by a non-linear projection function gv which further transforms hv into vector v:
  • Similar for text input:

1.2. Contrastive Loss

  • An image-to-text contrastive loss for the i-th pair:
  • where <> is the cosine similarity function.
  • This loss takes the same form as the InfoNCE loss in CPCv1 that maximally preserve the mutual information between the true pairs under the representation functions.
  • A similar text-to-image contrastive loss is:
  • The overall contrastive loss is:

1.3. Realization

  • gv and gu are modeled as single-hidden-layer neural networks:
  • where σ is a ReLU non-linearity, and similarly for gu.
  • For the image encoder fv, ResNet-50 is used.
  • For the text encoder fu, BERT encoder followed by a max-pooling layer over all output vectors, is used.
  • For the image transformation family tv~T, it is a five random transformations: cropping, horizontal flipping, affine transformation, color jittering and Gaussian blur.
  • For the text transformation function tu, a simple uniform sampling of a sentence from the input document xu, is used, to preserve the semantic meaning.

2. Results

2.1. Medical Image Classification

Results for the medical image classification tasks: (a) linear classification; (b) fine-tuning setting
  • (a) Linear Classification: Compared to random initialization, ImageNet initialization provides markedly better representations.
  • In-domain image initialization methods that use paired image-text data further improve over ImageNet initialization in almost all settings.

On three out of the four tasks, with only 1% training data ConVIRT is able to achieve classification results better than the default ImageNet initialization with 100% training data, highlighting the high quality of the learned representations from ConVIRT.

  • (b) Fine-Tuning: ImageNet initialization is again better than random initialization with smaller margins.
  • All in-domain initialization methods are better than the popular ImageNet initialization in most settings.

The proposed ConVIRT pretraining again achieves the best overall results in 10 out of the 11 settings, with the exception of the CheXpert dataset with all training data used, where the result of ConVIRT is similar to that of the Caption-Transformer result.

2.2. Image Retrieval

Zero-shot image-image and text-image retrieval results on the CheXpert 8×200 datasets
  • Using ImageNet pretrained CNN weights in a zero-shot image retrieval setting is only better than random guess by small margins.
  • All in-domain pretrained CNN weights achieve much better retrieval performance than ImageNet weights.

The proposed ConVIRT pretraining achieves the best overall retrieval results on all metrics.

2.3. Visualization

t-SNE visualizations of encoded image representations from different pretraining methods
  • ConVIRT pretraining achieves a better clustering of the images in the t-SNE plots. On the other hand, the lack of clear separations between groups suggests room for further improvement.

2.4. Compared with Image-Only Contrastive Learning

Comparisons of ConVIRT to image-only unsupervised image representation learning approaches
  • Compared to ImageNet initialization, both contrastive methods SimCLR and MoCo v2 lead to marginal to moderate improvements on the classification and retrieval tasks.

However, ConVIRT training strategy substantially outperforms both SimCLR and MoCo v2 methods on all tasks.

According to OpenReview, ConVIRT is rejected in 2021 ICLR though it has got over 100 citations already.


[2020 arXiv] [ConVIRT]
Contrastive Learning of Medical Visual Representations from Paired Images and Text

Self-Supervised Learning

19932020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] [ConVIRT] 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR]

Biomedical Image Classification

2017 [ChestX-ray8] 2019 [CheXpert] 2020 [VGGNet for COVID-19] [Dermatology] [ConVIRT] 2021 [MICLe] [MoCo-CXR]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.