Review — PIRL: Pretext-Invariant Representations Learning

PIRL, Outperforms MoCo, CMC, CPC, etc.

Sik-Ho Tsang
7 min readFeb 27, 2022
ImageNet classification with linear models. Single-crop top-1 accuracy on the ImageNet validation data against number of parameters

Self-Supervised Learning of Pretext-Invariant Representations
PIRL, by Facebook AI Research (FAIR)
2020 CVPR, Over 500 Citations (

@ Medium)
Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification, Object Detection

  • Many prior pretext tasks lead to representations that are covariant with image transformations.
  • Pretext-Invariant Representation Learning (PIRL, pronounced as “pearl”) is proposed where semantic representations are invariant under such transformations.

Outline

  1. PIRL Overview
  2. PIRL With Memory Bank
  3. PIRL Details
  4. SOTA Comparison
  5. Further Analysis

1. PIRL: Pretext-Invariant Representation Learning

Pretext-Invariant Representation Learning (PIRL)
  • Given an image dataset, D={I1, …, I|D|} with an image In having size of H×W×3, and a set of image transformations T.
  • A convolutional network Φθ(.) with parameters θ is trained that constructs image representations vI=Φθ(I) that are invariant to image transformations tT:
  • where p(T) is some distribution over the transformations in T, and It denotes image I after application of transformation t, that is, It=t(I).

Minimization of this loss encourages the network Φθ(.) to produce the same representation for image I as for its transformed counterpart It.

  • Contrastive loss is used to implement the above linv(.).
  • In the noise contrastive estimator (NCE), each “positive” sample (I, It) has N corresponding “negative” samples. The negative samples are obtained by computing features from other images.
  • The noise contrastive estimator models the probability of the binary event that (I; It) originates from data distribution as:
  • where a matching score, s(. , .), that measures the cosine similarity of two image representations, a set of N negative samples that are drawn uniformly at random from dataset D excluding image I, τ=0.07 is a temperature parameter.
  • (For NCE, please feel free to read NCE, Negative Sampling, CPC.)
  • (For temperature parameter, please feel free to read Distillation.)
  • A head f() is applied on features (vI) of I and a head g() is applied on features (vIt) of It. NCE then amounts to minimizing the following loss:

This loss encourages the representation of image I to be similar to that of its transformed counterpart It, whilst also encouraging the representation of It to be dissimilar to that of other images I'.

2. PIRL With Memory Bank

PIRL With Memory Bank
  • It is difficult to obtain a large number of negatives without increasing the batch to an infeasibly large size.
  • A memory bank of “cached” features is used following Instance Discrimination.
  • The memory bank, M, contains a feature representation mI for each image I in dataset D.

The representation mI is an exponential moving average (EMA) of feature representations f(vI) that were computed in prior epochs. A weight of 0.5 is used to compute the EMA.

  • This allows to replace negative samples, f(v’I), by their memory bank representations, mI’, without having to increase the training batch size.
  • It is emphasized that the representations that are stored in the memory bank are all computed on the original images, I, without the transformation t.
  • Two NCE loss is considered for the loss function:
  • The first term is simply the loss of Equation 4 but uses memory representations mI and mIinstead of f(vI) and f(v’I), respectively.
  • The second term does two things: (1) it encourages the representation f(vI) to be similar to its memory representation mI, thereby dampening the parameter updates; and (2) it encourages the representations f(vI) and f(v’I) to be dissimilar.
  • λ=0.5 as default. λ=0 becomes the loss used in Instance Discrimination.

3. PIRL Details

3.1. Some Details

  • ResNet-50 (R-50) is used to compute image representations for both I and It.
  • The representation of I, f(vI), is computed by extracting res5 features, average pooling, and a linear projection to obtain a 128-dimensional representation.

3.2. PIRL With Jigsaw

  • The representation g(vIt) of a transformed image It, is computed as follows:
  1. extract nine patches from image I,
  2. compute an image representation for each patch separately by extracting activations from the res5 layer of the ResNet-50 and average pool the activations,
  3. apply a linear projection to obtain a 128-dimensional patch representations, and,
  4. concatenate the patch representations in random order and apply a second linear projection on the result to obtain the final 128-dimensional image representation, g(vIt).

4. SOTA Comparison

4.1. Object Detection

Object detection on VOC07+12 using Faster R-CNN
  • NPID: A special case of PIRL by setting λ=0.
  • NPID++: By using more negative samples and training for more epochs.

PIRL outperforms all alternative self-supervised learnings in terms of all three AP measures.

  • Compared to pretraining on the Jigsaw pretext task, PIRL achieves AP improvements of 5 points.
  • PIRL also outperforms NPID++.

Interestingly, PIRL even outperforms the supervised ImageNet-pretrained model in terms of the more conservative APall and AP75 metrics. Similar to concurrent work MoCo, a self-supervised learner can outperform supervised pre-training for object detection. This result is a substantial improvement.

Object detection on VOC07 using Faster R-CNN

On VOC07, PIRL also outperforms supervised pretraining when finetuning is done on the much smaller VOC07 train+val set.

4.2. Image Classification with Linear Models

Image classification with linear models
  • The quality of image representations by training linear classifiers on fixed image representations.

4.2.1. ImageNet

On ImageNet, PIRL improves recognition accuracies by over 15% compared to its covariant counterpart, Jigsaw. PIRL achieves the highest single-crop top-1 accuracy of all self-supervised learners that use a single ResNet-50 model.

  • NPID++ achieves a single-crop top-1 accuracy of 59%, which is higher or on par with existing work that uses a single ResNet-50. Yet, PIRL substantially outperforms NPID++.
  • “PIRL-ens.”: using two ResNet-50, obtained a top-1 accuracy of 65.7%.
  • “PIRL-c2x”: doubling the number of channels in ResNet-50, 67.4%, which is close to the accuracy obtained by AMDIM [4] with a model that has 6× more parameters, as shown in the first figure at the top.

As in the figure at the top, PIRL outperforms all prior self-supervised learners on ImageNet in terms of the trade-off between model accuracy and size.

4.2.2. Other Datasets

PIRL sets a new state-of-the-art for self-supervised representations in this learning setting on the VOC07, Places205, and iNaturalist datasets.

4.3. Semi-Supervised Image Classification

Semi-supervised learning on ImageNet
  • PIRL performs at least as well as S⁴L [75] and better than VAT [20].
  • PIRL also outperforms Jigsaw and NPID++.

4.4. Pre-Training on Uncurated Image Data

Pre-training on uncurated YFCC images

By training linear classifiers on fixed image representations, PIRL trained on YFCC1M even outperforms Jigsaw and DeeperCluster models that were trained on 100× more data (trained on YFCC100M) from the same distribution.

5. Further Analysis

5.1. Does PIRL learn invariant representations?

Invariance of PIRL representations

The distances between Jigsaw representations have a much larger mean and variance, which suggests that Jigsaw representations covary with the image transformations that were applied.

5.2. Which layer produces the best representations?

The best image representations are extracted from the res5 layer of PIRL-trained networks.

5.3. What is the effect of λ in the PIRL loss function?

Effect of varying the trade-off parameter λ

The performance of PIRL is quite sensitive to the setting of λ, and that the best performance if obtained by setting =0.5.

5.4. What is the effect of the number of image transforms?

PIRL outperforms Jigsaw for all cardinalities of T but that PIRL particularly benefits from being able to use very large numbers of image transformations (i.e., large |T|) during training.

5.5. What is the effect of the number of negative samples?

Increasing the number of negatives tends to have a positive influence on the quality of the image representations constructed by PIRL.

5.6. Generalizing PIRL to Other Pretext Tasks

Using PIRL with (combinations of) different pretext tasks
  • The pretext image transforms are combined from both the Jigsaw and Rotation tasks in the set of image transformations, T. Specifically, It is used by first applying a rotation and then performing a Jigsaw.

The results demonstrate that combining image transforms from multiple pretext tasks can further improve image representations.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.