Brief Review — Contrastive Learning of Medical Visual Representations from Paired Images and Text

ConVIRT, Image+Text Contrastive Learning, Outperforms Image-Based Contrastive Learning SimCLR & MoCo v2

Examples of Paired Images and Texts
  • Contrastive VIsual Representation Learning from Text (ConVIRT), is proposed, which pretrains medical image encoders with the paired image and text data via a bidirectional contrastive objective between the two modalities.


  1. Contrastive VIsual Representation Learning from Text (ConVIRT)
  2. Results

1. Contrastive VIsual Representation Learning from Text (ConVIRT)

Overview of ConVIRT framework

1.1. Overall Framework

  • At a high level, each input image xv and text xu are converted into d-dimensional vector representations v and u respectively, following a similar processing pipeline.
  • For each input image xv, our method starts by drawing a random view ~xv from xv with a sampled transformation function tv~T, where T represents a family of stochastic image transformation functions described later.
  • Next, the encoder function fv transforms ~xv into a fixed-dimensional vector hv, followed by a non-linear projection function gv which further transforms hv into vector v:
  • Similar for text input:

1.2. Contrastive Loss

  • An image-to-text contrastive loss for the i-th pair:
  • where <> is the cosine similarity function.
  • This loss takes the same form as the InfoNCE loss in CPCv1 that maximally preserve the mutual information between the true pairs under the representation functions.
  • A similar text-to-image contrastive loss is:
  • The overall contrastive loss is:

1.3. Realization

  • gv and gu are modeled as single-hidden-layer neural networks:
  • where σ is a ReLU non-linearity, and similarly for gu.
  • For the image encoder fv, ResNet-50 is used.
  • For the text encoder fu, BERT encoder followed by a max-pooling layer over all output vectors, is used.
  • For the image transformation family tv~T, it is a five random transformations: cropping, horizontal flipping, affine transformation, color jittering and Gaussian blur.
  • For the text transformation function tu, a simple uniform sampling of a sentence from the input document xu, is used, to preserve the semantic meaning.

2. Results

2.1. Medical Image Classification

Results for the medical image classification tasks: (a) linear classification; (b) fine-tuning setting
  • (a) Linear Classification: Compared to random initialization, ImageNet initialization provides markedly better representations.
  • In-domain image initialization methods that use paired image-text data further improve over ImageNet initialization in almost all settings.
  • (b) Fine-Tuning: ImageNet initialization is again better than random initialization with smaller margins.
  • All in-domain initialization methods are better than the popular ImageNet initialization in most settings.

2.2. Image Retrieval

Zero-shot image-image and text-image retrieval results on the CheXpert 8×200 datasets
  • Using ImageNet pretrained CNN weights in a zero-shot image retrieval setting is only better than random guess by small margins.
  • All in-domain pretrained CNN weights achieve much better retrieval performance than ImageNet weights.

2.3. Visualization

t-SNE visualizations of encoded image representations from different pretraining methods
  • ConVIRT pretraining achieves a better clustering of the images in the t-SNE plots. On the other hand, the lack of clear separations between groups suggests room for further improvement.

2.4. Compared with Image-Only Contrastive Learning

Comparisons of ConVIRT to image-only unsupervised image representation learning approaches
  • Compared to ImageNet initialization, both contrastive methods SimCLR and MoCo v2 lead to marginal to moderate improvements on the classification and retrieval tasks.


Self-Supervised Learning

Biomedical Image Classification

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store