Review — Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Barlow Twins, No Need Large Batch Size, No Need Asymmetry

5 min readJul 26, 2022

Barlow Twins: Self-Supervised Learning via Redundancy Reduction,
Barlow Twins, by Facebook AI Research, and New York University,
2021 ICML, Over 400 Citations (Sik-Ho Tsang)
Self-Supervised Learning, Semi-Supervised Learning, Image Classification

An objective function is proposed that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. The method is called Barlow Twins, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks.
It does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates.
This is a paper from LeCun’s Research Group.

Outline

Barlow Twins: Framework & Loss Function
Barlow Twins: Intuitions Behind & Advantages
Experimental Results

1. Barlow Twins: Framework & Loss Function

1.1. Framework

Barlow Twins produces two distorted views for all images of a batch X sampled from a dataset. The distorted views are obtained via a distribution of data augmentations T.
The two batches of distorted views YA and YB are then fed to a function f, typically a deep network with trainable parameters θ, producing batches of embeddings ZA and ZB respectively.
To simplify notations, ZA and ZB are assumed to be mean-centered along the batch dimension, such that each unit has mean output 0 over the batch.

1.2. Loss Function

It has an innovative loss function LBT:

where λ is a positive constant trading off the importance of the first and second terms of the loss, and where C is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension:

where b indexes batch samples and i, j index the vector dimension of the networks’ outputs.

C is a square matrix with size the dimensionality of the network’s output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

2. Barlow Twins: Intuitions Behind & Advantages

2.2. Intuitions Behind

Intuitively, the invariance term of the objective, by trying to equate the diagonal elements of the cross-correlation matrix to 1, makes the embedding invariant to the distortions applied.
The redundancy reduction term, by trying to equate the off-diagonal elements of the cross-correlation matrix to 0, decorrelates the different vector components of the embedding.

2.3. Advantages

It does not require a large number of negative samples and can thus operate on small batches.
It benefits from very high-dimensional embeddings.

2.4. Some Details

The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224×224, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization.
The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability.
ResNet-50 is used as the encoder, outputting 2048 output units, followed by a projector network. The projector network has 3 linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.
The output of the encoder is called the ‘representations’ and the output of the projector is called the ‘embeddings’.
The representations are used for downstream tasks and the embeddings are fed to the loss function of Barlow Twins.

3. Experimental Results

3.1. Linear Evaluation on ImageNet

**Top-1 and top-5 accuracies (in %) under linear evaluation on ImageNet**

A linear classifier is trained on ImageNet on top of fixed representations of a ResNet-50 pretrained with Barlow Twins.

Barlow Twins obtains a top-1 accuracy of 73.2% which is comparable to the state-of-the-art methods.

3.2. Semi-Supervised Learning on ImageNet

**Semi-supervised learning on ImageNet using 1% and 10% training examples**

ResNet-50 pretrained with Barlow Twins is fine-tuned on a subset (1% or 10%) of ImageNet.

Barlow Twins is either on par (when using 10% of the data) or slightly better (when using 1% of the data) than competing methods.

3.3. Transfer to Other Datasets and Tasks

**Transfer learning: image classification**

A linear classifier is trained on fixed image representations.

Barlow Twins performs competitively against prior work, and outperforms SimCLR and MoCo-v2 on most datasets.

**Transfer learning: object detection and instance segmentation**

VOC07+12 and COCO are used to fine-tune the CNN parameters.

Barlow Twins performs comparably or better than state-of-the-art representation learning methods for these localization tasks.

(There are still a lot of ablation results and appendices not yet presented. Please feel free to read the paper directly if interested.)

With Barlow Twins loss function, self-supervised learning can be performed without the need of large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates.

References

[2021 ICML] [Barlow Twins]
Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Self-Supervised Learning

1993 … 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins]