# Review — Barlow Twins: Self-Supervised Learning via Redundancy Reduction

## Barlow Twins, No Need Large Batch Size, No Need Asymmetry

--

Barlow Twins: Self-Supervised Learning via Redundancy Reduction,, by Facebook AI Research, and New York University,

Barlow Twins2021 ICML, Over 400 Citations(Sik-Ho Tsang)

Self-Supervised Learning, Semi-Supervised Learning, Image Classification

**An objective function is proposed**that naturally avoids collapse by**measuring the cross-correlation matrix**between the outputs of two identical networks fed with distorted versions of a sample, and**making it as close to the identity matrix as possible**. The method is called**Barlow Twins**, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks.**It does not require large batches nor asymmetry between the network twins**such as a*predictor network*,*gradient**stopping*, or a*moving average*on the weight updates.- This is a paper from
**LeCun’s**Research Group.

# Outline

**Barlow Twins: Framework & Loss Function****Barlow Twins: Intuitions Behind & Advantages****Experimental Results**

**1. Barlow Twins: **Framework & Loss Function

## 1.1. Framework

- Barlow Twins produces
**two distorted views**for all images of a batch*X*sampled from a dataset. The distorted views are obtained via**a distribution of data augmentations**.*T* - The two batches of
**distorted views**are then*YA*and*YB***fed to a function**, typically a deep network with trainable parameters*f**θ*,**producing batches of embeddings**respectively.*ZA*and*ZB* - To simplify notations,
*ZA*and*ZB*are assumed to be mean-centered along the batch dimension, such that each unit has mean output 0 over the batch.

## 1.2. Loss Function

- It has an innovative
**loss function**:*LBT*

- where
is a*λ***positive constant trading off**the importance of the first and second terms of the loss, and whereis the*C***cross-correlation matrix**computed between the outputs of the two identical networks along the batch dimension:

- where
*b*indexes batch samples and*i*,*j*index the vector dimension of the networks’ outputs.

is a square matrix with size the dimensionality of the network’s output, and with values comprisedCbetween -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

# 2. **Barlow Twins: Intuitions Behind & Advantages**

## 2.2. Intuitions Behind

- Intuitively, the
**invariance term**of the objective, by trying to equate the diagonal elements of the cross-correlation matrix to 1,**makes the embedding invariant to the distortions applied.** - The
**redundancy reduction term**, by trying to equate the off-diagonal elements of the cross-correlation matrix to 0,**decorrelates the different vector components of the embedding.**

## 2.3. Advantages

- It does not require a large number of negative samples and can thus operate on small batches.
- It benefits from very high-dimensional embeddings.

## 2.4. Some Details

- The image
**augmentation pipeline**consists of the following transformations:**random cropping, resizing to 224×224, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization**. - The
**first two**transformations (cropping and resizing) are**always applied**, while the**last five**are**applied randomly**, with some probability. - ResNet-50 is used as the encoder, outputting 2048 output units, followed by a projector network. The projector network has 3 linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.
- The
**output of the encoder**is called the ‘**representations**’ and the o**utput of the projector**is called the ‘**embeddings**’. - The
**representations**are used for**downstream tasks**and the**embeddings**are fed to the**loss function**of Barlow Twins.

# 3. Experimental Results

## 3.1. Linear Evaluation on ImageNet

- A linear classifier is trained on ImageNet on top of fixed representations of a ResNet-50 pretrained with Barlow Twins.

Barlow Twins obtains a

top-1 accuracy of 73.2%which iscomparable to the state-of-the-art methods.

## 3.2. Semi-Supervised Learning on ImageNet

- ResNet-50 pretrained with Barlow Twins is fine-tuned on a subset (1% or 10%) of ImageNet.

Barlow Twins is either

on par (when using 10% of the data) or slightly better (when using 1% of the data) than competing methods.

## 3.3. Transfer to Other Datasets and Tasks

- A linear classifier is trained on fixed image representations.

Barlow Twins performs

competitively against prior work, andoutperforms SimCLR and MoCo-v2 on most datasets.

- VOC07+12 and COCO are used to fine-tune the CNN parameters.

Barlow Twins performs

comparably or better than state-of-the-art representation learningmethods for these localization tasks.

(There are still a lot of ablation results and appendices not yet presented. Please feel free to read the paper directly if interested.)

With Barlow Twins loss function,** self-supervised learning can be performed** **without the need of large batches nor asymmetry between the network twins **such as a *predictor network*, *gradient* *stopping*, or a *moving average *on the weight updates.

## References

[2021 ICML] [Barlow Twins]

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

## Self-Supervised Learning

**1993** … **2020 **[CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] **2021** [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins]