Review — CLIP: Learning Transferable Visual Models From Natural Language Supervision

Contrastive Language-Image Pre-Training (CLIP), Learn Image Representation From Image Captioning Dataset

CLIP ViT-L is much better than ImageNet-Pretrained ResNet-101 for other datasets.
  • Conventionally, a fixed set of predetermined object categories is used for training and prediction, e.g.: ImageNet.
  • Contrastive Language-Image Pre-Training (CLIP) is proposed to have the pre-training task of predicting which caption to learn image representations from scratch on a dataset of 400 million (image, text) pairs.
  • After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.
  • (For fast read, please read 1, 2, 3.1, and 4.)


  1. Motivations & WebImageText (WIT) Dataset
  2. Contrastive Language-Image Pre-Training (CLIP)
  3. Zero-Shot Transfer Results
  4. Linear Probe Results
  5. Task & Distribution Shift Robustness

1. Motivations & WebImageText (WIT) Dataset

1.1. Motivation

  • Learning from natural language is much easier to scale compared to standard crowd-sourced labeling for image classification since it does not require annotations.
  • It also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer.

1.2. Creating a Sufficiently Large Dataset: WebImageText (WIT)

  • A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet.
  • The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams.
  • Then (image, text) pairs are searched as part of the construction process whose text includes one of a set of 500,000 queries. Class balance is approximated by including up to 20,000 (image, text) pairs per query.
  • The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. This dataset is referred as WIT for WebImageText. Due to the large size of the pre-training dataset, over-fitting is not a major concern.

2. Contrastive Language-Image Pre-Training (CLIP)

2.1. Pretraining Framework

CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time, the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.
Numpy-like pseudocode for the core of an implementation of CLIP.
  • CLIP is trained from scratched, and does not use any non-linear projection layers, instead only a linear projection is used to map from each encoder’s representation to the multi-modal embedding space.
  • The text transformation function tu, as in ConVIRT, is removed.
  • A random square crop from resized images is the only image data augmentation tv.
  • Finally, the temperature parameter τ=0.07, as in Knowledge Distillation, which controls the range of the logits in the softmax.

2.2. Models

  • For image encoder, ResNet-50 is used as base model, with some modifications. Specifically, ResNet-D is used. (The anti-aliased rect-2 blur pooling is used. Attention pooling is used to replace global average pooling, which is a single layer of “Transformer-style” multi-head QKV attention.)
  • Another image encoder considered is Vision Transformer, ViT. (With modification of adding an additional layer normalization to the combined patch and position embeddings before the Transformer and slightly different initialization scheme is used.)
  • The text encoder is a Transformer. As a base size, a 63M-parameter 12- layer 512-wide model with 8 attention heads, is used.
  • Only the width of the Transformer is scaled to be proportional to the calculated increase in width of the ResNet and the Transformer depth is not scaled at all.
  • In practical, a series of 5 ResNets and 3 ViTs is used.
  • For the ResNets, a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4×, 16×, and 64× the compute of a ResNet-50. They are denoted as RN50×4, RN50×16, and RN50×64 respectively.
  • For the ViTs, a ViT-B/32, a ViT-B/16, and a ViT-L/14.

2.3. Training

  • All models are trained for 32 epochs. A very large minibatch size of 32,768 is used. Mixed-Precision Training. The calculation of embedding similarities was also sharded with individual GPUs.
  • The largest ResNet model, RN50×64, took 18 days to train on 592 V100 GPUs while the largest ViT took 12 days on 256 V100 GPUs.
  • For the ViT-L/14, it is also pre-trained at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes, denoted as ViT-L/14@336px.

3. Zero-Shot Transfer Results

3.1. Comparison with Visual N-Grams

Comparison with Visual N-Grams
  • Visual N-Grams is a primitive conceptual approach leveraging text for visual tasks.
Som non-cherry picked, predictions of zero-shot CLIP classifiers on various datasets

3.2. Prompt Engineering and Ensembling

Prompt engineering and ensembling improve zero-shot performance.
  • Using the prompt template, e.g.: “A photo of a <label>.” to be a good default that helps specify the text is about the content of the image.
  • Ensembling over multiple zero-shot classifiers as another way of improving performance.

3.3. Comparison with Few-Shot Linear Probes

Zero-shot CLIP outperforms few-shot linear probes.
  • Zero-shot CLIP roughly matches the performance of the best performing 16-shot classifier in the evaluation suite, which uses the features of a BiT-M ResNet-152×2 trained on ImageNet-21K.

3.4. Room for Improvement

Zero-shot performance is correlated with linear probe performance but still mostly sub-optimal.
  • The dashed, y=x line represents an “optimal” zero-shot classifier that matches the performance of its fully supervised equivalent.
  • Zero-shot CLIP only approaches fully supervised performance on 5 datasets: STL10, CIFAR10, Food101, OxfordPets, and Caltech101. On all 5 datasets, both zero-shot accuracy and fully supervised accuracy are over 90%.

4. Linear Probe Results

Linear probe performance of CLIP models in comparison with state-of-the-art computer vision models.
  • Models trained with CLIP scale very well and the largest model trained (ResNet-50×64) slightly outperforms the best performing existing model (a Noisy Student EfficientNet-L2) on both overall score and compute efficiency.
  • It is also found that CLIP ViTs are about 3× more compute efficient than CLIP ResNets.

5. Task & Distribution Shift Robustness

5.1. Task Shift Robustness

CLIP’s features are more robust to task shift when compared to models pre-trained on ImageNet.

5.2. Distribution Shift Robustness

Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models.


3.1. Visual/Vision/Video Language Model (VLM)

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store