Review — PIRL: Pretext-Invariant Representations Learning

PIRL, Outperforms MoCo, CMC, CPC, etc.

ImageNet classification with linear models. Single-crop top-1 accuracy on the ImageNet validation data against number of parameters
  • Many prior pretext tasks lead to representations that are covariant with image transformations.
  • Pretext-Invariant Representation Learning (PIRL, pronounced as “pearl”) is proposed where semantic representations are invariant under such transformations.


  1. PIRL Overview
  2. PIRL With Memory Bank
  3. PIRL Details
  4. SOTA Comparison
  5. Further Analysis

1. PIRL: Pretext-Invariant Representation Learning

Pretext-Invariant Representation Learning (PIRL)
  • Given an image dataset, D={I1, …, I|D|} with an image In having size of H×W×3, and a set of image transformations T.
  • A convolutional network Φθ(.) with parameters θ is trained that constructs image representations vI=Φθ(I) that are invariant to image transformations tT:
  • where p(T) is some distribution over the transformations in T, and It denotes image I after application of transformation t, that is, It=t(I).
  • Contrastive loss is used to implement the above linv(.).
  • In the noise contrastive estimator (NCE), each “positive” sample (I, It) has N corresponding “negative” samples. The negative samples are obtained by computing features from other images.
  • The noise contrastive estimator models the probability of the binary event that (I; It) originates from data distribution as:
  • where a matching score, s(. , .), that measures the cosine similarity of two image representations, a set of N negative samples that are drawn uniformly at random from dataset D excluding image I, τ=0.07 is a temperature parameter.
  • (For NCE, please feel free to read NCE, Negative Sampling, CPC.)
  • (For temperature parameter, please feel free to read Distillation.)
  • A head f() is applied on features (vI) of I and a head g() is applied on features (vIt) of It. NCE then amounts to minimizing the following loss:

2. PIRL With Memory Bank

PIRL With Memory Bank
  • It is difficult to obtain a large number of negatives without increasing the batch to an infeasibly large size.
  • A memory bank of “cached” features is used following Instance Discrimination.
  • The memory bank, M, contains a feature representation mI for each image I in dataset D.
  • This allows to replace negative samples, f(v’I), by their memory bank representations, mI’, without having to increase the training batch size.
  • It is emphasized that the representations that are stored in the memory bank are all computed on the original images, I, without the transformation t.
  • Two NCE loss is considered for the loss function:
  • The first term is simply the loss of Equation 4 but uses memory representations mI and mIinstead of f(vI) and f(v’I), respectively.
  • The second term does two things: (1) it encourages the representation f(vI) to be similar to its memory representation mI, thereby dampening the parameter updates; and (2) it encourages the representations f(vI) and f(v’I) to be dissimilar.
  • λ=0.5 as default. λ=0 becomes the loss used in Instance Discrimination.

3. PIRL Details

3.1. Some Details

  • ResNet-50 (R-50) is used to compute image representations for both I and It.
  • The representation of I, f(vI), is computed by extracting res5 features, average pooling, and a linear projection to obtain a 128-dimensional representation.

3.2. PIRL With Jigsaw

  • The representation g(vIt) of a transformed image It, is computed as follows:
  1. extract nine patches from image I,
  2. compute an image representation for each patch separately by extracting activations from the res5 layer of the ResNet-50 and average pool the activations,
  3. apply a linear projection to obtain a 128-dimensional patch representations, and,
  4. concatenate the patch representations in random order and apply a second linear projection on the result to obtain the final 128-dimensional image representation, g(vIt).

4. SOTA Comparison

4.1. Object Detection

Object detection on VOC07+12 using Faster R-CNN
  • NPID: A special case of PIRL by setting λ=0.
  • NPID++: By using more negative samples and training for more epochs.
  • Compared to pretraining on the Jigsaw pretext task, PIRL achieves AP improvements of 5 points.
  • PIRL also outperforms NPID++.
Object detection on VOC07 using Faster R-CNN

4.2. Image Classification with Linear Models

Image classification with linear models
  • The quality of image representations by training linear classifiers on fixed image representations.

4.2.1. ImageNet

  • NPID++ achieves a single-crop top-1 accuracy of 59%, which is higher or on par with existing work that uses a single ResNet-50. Yet, PIRL substantially outperforms NPID++.
  • “PIRL-ens.”: using two ResNet-50, obtained a top-1 accuracy of 65.7%.
  • “PIRL-c2x”: doubling the number of channels in ResNet-50, 67.4%, which is close to the accuracy obtained by AMDIM [4] with a model that has 6× more parameters, as shown in the first figure at the top.

4.2.2. Other Datasets

4.3. Semi-Supervised Image Classification

Semi-supervised learning on ImageNet
  • PIRL performs at least as well as S⁴L [75] and better than VAT [20].
  • PIRL also outperforms Jigsaw and NPID++.

4.4. Pre-Training on Uncurated Image Data

Pre-training on uncurated YFCC images

5. Further Analysis

5.1. Does PIRL learn invariant representations?

Invariance of PIRL representations

5.2. Which layer produces the best representations?

5.3. What is the effect of λ in the PIRL loss function?

Effect of varying the trade-off parameter λ

5.4. What is the effect of the number of image transforms?

5.5. What is the effect of the number of negative samples?

5.6. Generalizing PIRL to Other Pretext Tasks

Using PIRL with (combinations of) different pretext tasks
  • The pretext image transforms are combined from both the Jigsaw and Rotation tasks in the set of image transformations, T. Specifically, It is used by first applying a rotation and then performing a Jigsaw.



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store