Review — Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Barlow Twins, No Need Large Batch Size, No Need Asymmetry
Barlow Twins: Self-Supervised Learning via Redundancy Reduction,
Barlow Twins, by Facebook AI Research, and New York University,
2021 ICML, Over 400 Citations (Sik-Ho Tsang)
Self-Supervised Learning, Semi-Supervised Learning, Image Classification
- An objective function is proposed that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. The method is called Barlow Twins, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks.
- It does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates.
- This is a paper from LeCun’s Research Group.
Outline
- Barlow Twins: Framework & Loss Function
- Barlow Twins: Intuitions Behind & Advantages
- Experimental Results
1. Barlow Twins: Framework & Loss Function
1.1. Framework
- Barlow Twins produces two distorted views for all images of a batch X sampled from a dataset. The distorted views are obtained via a distribution of data augmentations T.
- The two batches of distorted views YA and YB are then fed to a function f, typically a deep network with trainable parameters θ, producing batches of embeddings ZA and ZB respectively.
- To simplify notations, ZA and ZB are assumed to be mean-centered along the batch dimension, such that each unit has mean output 0 over the batch.
1.2. Loss Function
- It has an innovative loss function LBT:
- where λ is a positive constant trading off the importance of the first and second terms of the loss, and where C is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension:
- where b indexes batch samples and i, j index the vector dimension of the networks’ outputs.
C is a square matrix with size the dimensionality of the network’s output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).
2. Barlow Twins: Intuitions Behind & Advantages
2.2. Intuitions Behind
- Intuitively, the invariance term of the objective, by trying to equate the diagonal elements of the cross-correlation matrix to 1, makes the embedding invariant to the distortions applied.
- The redundancy reduction term, by trying to equate the off-diagonal elements of the cross-correlation matrix to 0, decorrelates the different vector components of the embedding.
2.3. Advantages
- It does not require a large number of negative samples and can thus operate on small batches.
- It benefits from very high-dimensional embeddings.
2.4. Some Details
- The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224×224, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization.
- The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability.
- ResNet-50 is used as the encoder, outputting 2048 output units, followed by a projector network. The projector network has 3 linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.
- The output of the encoder is called the ‘representations’ and the output of the projector is called the ‘embeddings’.
- The representations are used for downstream tasks and the embeddings are fed to the loss function of Barlow Twins.
3. Experimental Results
3.1. Linear Evaluation on ImageNet
- A linear classifier is trained on ImageNet on top of fixed representations of a ResNet-50 pretrained with Barlow Twins.
Barlow Twins obtains a top-1 accuracy of 73.2% which is comparable to the state-of-the-art methods.
3.2. Semi-Supervised Learning on ImageNet
- ResNet-50 pretrained with Barlow Twins is fine-tuned on a subset (1% or 10%) of ImageNet.
Barlow Twins is either on par (when using 10% of the data) or slightly better (when using 1% of the data) than competing methods.
3.3. Transfer to Other Datasets and Tasks
- A linear classifier is trained on fixed image representations.
Barlow Twins performs competitively against prior work, and outperforms SimCLR and MoCo-v2 on most datasets.
- VOC07+12 and COCO are used to fine-tune the CNN parameters.
Barlow Twins performs comparably or better than state-of-the-art representation learning methods for these localization tasks.
(There are still a lot of ablation results and appendices not yet presented. Please feel free to read the paper directly if interested.)
With Barlow Twins loss function, self-supervised learning can be performed without the need of large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates.
References
[2021 ICML] [Barlow Twins]
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Self-Supervised Learning
1993 … 2020 [CMC] [MoCo] [CPCv2] [PIRL] [SimCLR] [MoCo v2] [iGPT] [BoWNet] [BYOL] [SimCLRv2] [BYOL+GN+WS] 2021 [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins]