# Review — PIRL: Pretext-Invariant Representations Learning

Self-Supervised Learning of Pretext-Invariant Representations

PIRL, by Facebook AI Research (FAIR)2020 CVPR, Over 500 Citations(Sik-Ho Tsang @ Medium)

**Many prior pretext tasks**lead to representations that are**covariant with image transformations.****Pretext-Invariant Representation Learning**(*PIRL*, pronounced as “*pearl*”) is proposed where**semantic representations are invariant under such transformations**.

# Outline

**PIRL Overview****PIRL With Memory Bank****PIRL Details****SOTA Comparison****Further Analysis**

**1. PIRL: Pretext-Invariant Representation Learning**

- Given
**an image dataset**,with an*D*={*I*1, …,*I*|*D*|}**image**having size of*In*, and*H*×*W*×3**a set of image transformations**.*T* **A convolutional network**with parameters*Φθ*(.)*θ*is trained that constructs**image representations**that are invariant to image transformations*vI*=*Φθ*(*I*)*t*∈*T*:

- where
*p*(*T*) is some distribution over the transformations in*T*, and*It*denotes image*I*after application of transformation*t*, that is,*It*=*t*(*I*).

Minimization of this lossencourages the networkΦθ(.) toproduce the same representation for imageIas for its transformed counterpartIt.

**Contrastive loss**is used to implement the above*linv*(.).- In the
**noise contrastive estimator (****NCE****)**, each “positive” sample (*I*,*It*) has*N*corresponding “negative” samples. The negative samples are obtained by computing features from other images. - The noise contrastive estimator models the probability of the binary event that (I; It) originates from data distribution as:

- where
**a matching score,**, that measures*s*(. , .)**the cosine similarity of two image representations**, a set of*N*negative samples*I*,is a*τ*=0.07**temperature**parameter. - (For NCE, please feel free to read NCE, Negative Sampling, CPC.)
- (For temperature parameter, please feel free to read Distillation.)
**A head**is applied on features (*f*()*vI*) of*I*and**a head**is applied on features (*g*()*vIt*) of*It*. NCE then amounts to minimizing the following**loss**:

This loss e

ncourages the representation of image, whilst alsoIto be similar to that of its transformed counterpartItencouraging the representation of.Itto be dissimilar to that of other imagesI'

**2. PIRL With Memory Bank**

- It is difficult to obtain a
**large number of negatives**without increasing the batch to an infeasibly large size. **A memory bank of “cached” features**is used following Instance Discrimination.**The memory bank,**, contains*M***a feature representation**in dataset*mI*for each image*I**D*.

The representationis anmIexponential moving average (EMA) of feature representationsthat were computed in prior epochs. A weight of 0.5 is used to compute the EMA.f(vI)

- This allows to replace negative samples,
*f*(*v’I*), by their memory bank representations,*mI*’, without having to increase the training batch size. - It is emphasized that the
**representations that are stored in the memory****bank**are all computed on the original images,*I*,**without the transformation***t*. **Two NCE loss**is considered for the loss function:

**The first term**is simply**the loss of Equation 4 but uses memory representations**instead of*mI*and*mI*’*f*(*vI*) and*f*(*v’I*), respectively.**The second term**does two things: (1) it**encourages the representation**, thereby dampening the parameter updates; and (2) it*f*(*vI*) to be similar to its memory representation*mI***encourages the representations***f*(*vI*) and*f*(*v’I*) to be dissimilar.as default.*λ*=0.5*λ*=0 becomes the loss used in Instance Discrimination.

**3. PIRL Details**

## 3.1. Some Details

**ResNet****-50**(R-50) is used to compute image representations for both*I*and*It*.- The representation of
*I*,, is computed by extracting res5 features, average pooling, and a linear projection to obtain a 128-dimensional representation.*f*(*vI*)

## 3.2. PIRL With Jigsaw

- The representation
*g*(*vIt*) of a transformed image*It*, is computed as follows:

**extract nine patches**from image*I*,**compute an image representation for each patch**separately by extracting activations from the res5 layer of the ResNet-50 and average pool the activations,**apply a linear projection**to obtain a 128-dimensional patch representations, and,**concatenate the patch representations**in random order and**apply a second linear projection**on the result to obtain**the final 128-dimensional image representation,**.*g*(*vIt*)

# 4. SOTA Comparison

## 4.1. Object Detection

**NPID**: A special case of PIRL by setting.*λ*=0**NPID++**: By using**more negative samples**and training for**more epochs**.

PIRL outperforms all alternative self-supervised learnings in terms of all three AP measures.

- Compared to pretraining on the Jigsaw pretext task, PIRL achieves AP improvements of 5 points.
- PIRL also outperforms NPID++.

Interestingly,

PIRL even outperforms the supervised ImageNet-pretrained model in terms of the more conservative APall and AP75 metrics.Similar to concurrent work MoCo, a self-supervised learner can outperform supervised pre-training for object detection. This result is a substantial improvement.

On VOC07,

PIRL also outperforms supervised pretrainingwhen finetuning is done on the much smaller VOC07 train+val set.

## 4.2. Image Classification with Linear Models

- The quality of image representations by training
**linear classifiers on fixed image representations**.

## 4.2.1. ImageNet

On ImageNet, PIRL improves recognition accuracies by over 15% compared to its covariant counterpart, Jigsaw.

PIRL achieves the highest single-crop top-1 accuracy of all self-supervised learners that use a singleResNet-50 model.

**NPID++**achieves a single-crop top-1 accuracy of**59%**, which is higher or on par with existing work that uses a single ResNet-50. Yet,**PIRL substantially outperforms NPID++**.**“PIRL-ens.”**: using**two****ResNet****-50**, obtained a top-1 accuracy of**65.7**%.**“PIRL-c2x”**:**doubling the number of channels**in ResNet-50,**67.4%**, which is close to the accuracy obtained by AMDIM [4] with a model that has 6× more parameters, as shown in the first figure at the top.

As in the figure at the top,

PIRL outperforms all prior self-supervised learners on ImageNet in terms of the trade-off between model accuracy and size.

## 4.2.2. Other Datasets

PIRL sets a

new state-of-the-art for self-supervised representationsin this learning setting on the VOC07, Places205, and iNaturalist datasets.

## 4.3. Semi-Supervised Image Classification

- PIRL performs at least as well as S⁴L [75] and better than VAT [20].
- PIRL also outperforms Jigsaw and NPID++.

## 4.4. Pre-Training on Uncurated Image Data

By training linear classifiers on fixed image representations,

PIRL trained on YFCC1M even outperformsJigsawand DeeperCluster models that were trained on 100× more data(trained on YFCC100M) from the same distribution.

# 5. Further Analysis

## 5.1. Does PIRL learn invariant representations?

The distances betweenJigsawrepresentations have a much larger mean and variance, which suggests that Jigsaw representations covary with the image transformations that were applied.

## 5.2. Which layer produces the best representations?

The

bestimage representations are extracted from theres5layer of PIRL-trained networks.

## 5.3. What is the effect of *λ* in the PIRL loss function?

*λ*

The performance of PIRL is

quite sensitive to the setting of, and that theλbestperformance if obtained by setting.λ=0.5

## 5.4. What is the effect of the number of image transforms?

PIRL outperforms Jigsaw for all cardinalities of

Tbut that PIRL particularly benefits from being able to use very large numbers of image transformations (i.e., large |T|) during training.

## 5.5. What is the effect of the number of negative samples?

Increasing the number of negatives tends to have a positive influence on the quality of the image representations constructed by PIRL.

## 5.6. Generalizing PIRL to Other Pretext Tasks

- The pretext image transforms are combined from both the Jigsaw and Rotation tasks in the set of image transformations,
*T*. Specifically, It is used by**first applying a rotation**and**then performing a****Jigsaw**.

The results demonstrate that combining image transforms from multiple pretext tasks can

further improve image representations.

## Reference

[2020 CVPR] [PIRL]

Self-Supervised Learning of Pretext-Invariant Representations

