Review — Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Barlow Twins, No Need Large Batch Size, No Need Asymmetry

Sik-Ho Tsang
5 min readJul 26, 2022

Barlow Twins: Self-Supervised Learning via Redundancy Reduction,
Barlow Twins
, by Facebook AI Research, and New York University,
2021 ICML, Over 400 Citations (Sik-Ho Tsang)
Self-Supervised Learning, Semi-Supervised Learning, Image Classification

  • An objective function is proposed that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. The method is called Barlow Twins, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks.
  • It does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates.
  • This is a paper from LeCun’s Research Group.

Outline

  1. Barlow Twins: Framework & Loss Function
  2. Barlow Twins: Intuitions Behind & Advantages
  3. Experimental Results

1. Barlow Twins: Framework & Loss Function

Barlow Twins Framework

1.1. Framework

  • Barlow Twins produces two distorted views for all images of a batch X sampled from a dataset. The distorted views are obtained via a distribution of data augmentations T.
  • The two batches of distorted views YA and YB are then fed to a function f, typically a deep network with trainable parameters θ, producing batches of embeddings ZA and ZB respectively.
  • To simplify notations, ZA and ZB are assumed to be mean-centered along the batch dimension, such that each unit has mean output 0 over the batch.

1.2. Loss Function

  • It has an innovative loss function LBT:
  • where λ is a positive constant trading off the importance of the first and second terms of the loss, and where C is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension:
  • where b indexes batch samples and i, j index the vector dimension of the networks’ outputs.

C is a square matrix with size the dimensionality of the network’s output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

2. Barlow Twins: Intuitions Behind & Advantages

2.2. Intuitions Behind

  • Intuitively, the invariance term of the objective, by trying to equate the diagonal elements of the cross-correlation matrix to 1, makes the embedding invariant to the distortions applied.
  • The redundancy reduction term, by trying to equate the off-diagonal elements of the cross-correlation matrix to 0, decorrelates the different vector components of the embedding.

2.3. Advantages

  • It does not require a large number of negative samples and can thus operate on small batches.
  • It benefits from very high-dimensional embeddings.

2.4. Some Details

  • The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224×224, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization.
  • The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability.
  • ResNet-50 is used as the encoder, outputting 2048 output units, followed by a projector network. The projector network has 3 linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.
  • The output of the encoder is called the ‘representations’ and the output of the projector is called the ‘embeddings’.
  • The representations are used for downstream tasks and the embeddings are fed to the loss function of Barlow Twins.

3. Experimental Results

3.1. Linear Evaluation on ImageNet

Top-1 and top-5 accuracies (in %) under linear evaluation on ImageNet
  • A linear classifier is trained on ImageNet on top of fixed representations of a ResNet-50 pretrained with Barlow Twins.

Barlow Twins obtains a top-1 accuracy of 73.2% which is comparable to the state-of-the-art methods.

3.2. Semi-Supervised Learning on ImageNet

Semi-supervised learning on ImageNet using 1% and 10% training examples
  • ResNet-50 pretrained with Barlow Twins is fine-tuned on a subset (1% or 10%) of ImageNet.

Barlow Twins is either on par (when using 10% of the data) or slightly better (when using 1% of the data) than competing methods.

3.3. Transfer to Other Datasets and Tasks

Transfer learning: image classification
  • A linear classifier is trained on fixed image representations.

Barlow Twins performs competitively against prior work, and outperforms SimCLR and MoCo-v2 on most datasets.

Transfer learning: object detection and instance segmentation
  • VOC07+12 and COCO are used to fine-tune the CNN parameters.

Barlow Twins performs comparably or better than state-of-the-art representation learning methods for these localization tasks.

(There are still a lot of ablation results and appendices not yet presented. Please feel free to read the paper directly if interested.)

With Barlow Twins loss function, self-supervised learning can be performed without the need of large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates.

References

[2021 ICML] [Barlow Twins]
Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Self-Supervised Learning

19932020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.