Review — BYOL: Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
BYOL, by DeepMind, and Imperial College
2020 NeurIPS, Over 1100 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Unsupervised Learning, Teacher Student, Representation Learning, Image Classification, Semantic Segmentation, Depth Prediction
- BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other.
- From an augmented view of an image, the online network is trained to predict the target network representation of the same image under a different augmented view.
- At the same time, the target network is updated with a slow-moving average of the online network.
- BYOL does NOT need any negative samples since it is NOT trained using contrastive loss.
- Bootstrap Your Own Latent (BYOL)
- Experimental Results
- Ablation Study
1. Bootstrap Your Own Latent (BYOL)
- BYOL’s goal is to learn a representation yθ which can then be used for downstream tasks.
- BYOL uses two neural networks to learn: the online and target networks.
1.1. Online Network
- The online network is defined by a set of weights θ and is comprised of three stages: an encoder fθ, a projector gθ and a predictor qθ.
1.2. Target Network
- The target network has the same architecture as the online network, but uses a different set of weights ξ.
- The target network provides the regression targets to train the online network, and its parameters ξ are an exponential moving average (EMA) of the online parameters θ. More precisely, given a target decay rate τ ∈ [0, 1], after each training step, it is updated as follows:
- Given a set of images D, an image x~D sampled uniformly from D, and two distributions of image augmentations T and T’, BYOL produces two augmented views v=t(x) and v’=t’(x) from x by applying respectively image augmentations t~T and t’~T’.
- From the first augmented view v, the online network outputs a representation yθ=fθ(v) and a projection zθ=gθ(y).
- The target network outputs y’ξ=fξ(v’) and the target projection z’ξ=gξ(y’) from the second augmented view v’.
- A prediction qθ(zθ) of z’ξ is output. Then qθ(zθ) and z’ξ are both l2-normalized.
- The predictor is only applied to the online branch, making the architecture asymmetric between the online and target pipeline.
- Finally, the mean squared error between the normalized predictions and target projections is estimated:
- By separately feeding v’ to the online network and v to the target network, loss is symmetrized as ~Lθ,ξ. A stochastic optimization step is to minimize with respect to θ only, but not ξ, as depicted by the stop-gradient (sg):
- where optimizer is an optimizer and η is a learning rate.
- At the end of training, only the encoder fθ is kept.
1.4. BYOL Avoids Collapse Solution
- Since BYOL does not use negative samples, it is questioned that BYOL will obtain collapse or trivial solution.
Authors explain that BYOL makes ξ closer to θ, incorporating sources of variability captured by the online projection into the target projection. Performing a hard-copy of the online parameters θ into the target parameters ξ would be enough to propagate new sources of variability. This variability helps BYOL avoid collapse solution.
- However, sudden changes in the target network might break the assumption of an optimal predictor, in which case BYOL’s loss is not guaranteed to be close to the conditional variance. It is hypothesized that the main role of BYOL’s moving-averaged target network is to ensure the near-optimality of the predictor over training.
2. Experimental Results
2.1. Linear Evaluation on ImageNet
- A linear classifier is trained on top of the frozen representation to evaluate BYOL’s representation.
With a standard ResNet-50 (1×), BYOL obtains 74.3% top-1 accuracy (91.6% top-5 accuracy), which is a 1.3% (resp. 0.5%) improvement over the previous self-supervised state of the art.
- This tightens the gap with respect to the supervised baseline, 76.5%, but is still significantly below the stronger supervised baseline, 78.9%.
- With deeper and wider architectures, BYOL consistently outperforms the previous state of the art , and obtains a best performance of 79.6% top-1 accuracy, ranking higher than previous self-supervised approaches.
On a ResNet-50 (4×), BYOL achieves 78.6%, similar to the 78.9% of the best supervised baseline for the same architecture.
2.2. Semi-Supervised Training on ImageNet
- BYOL’s representation is fine-tuned on a classification task with a small subset of ImageNet’s train set, using label information.
BYOL consistently outperforms previous approaches across a wide range of architectures.
2.3. Transfer to Other Classification Tasks
- BYOL outperforms SimCLR on all benchmarks and the Supervised-IN baseline on 7 of the 12 benchmarks, providing only slightly worse performance on the 5 remaining benchmarks.
BYOL’s representation can be transferred over to small images.
2.4. Transfer to Other Vision Tasks
- On VOC2012 semantic segmentation, BYOL outperforms both the Supervised-IN baseline (+1.9 mIoU) and SimCLR (+1.1 mIoU).
- Ob object detection, using Faster R-CNN as detector, BYOL is significantly better than the Supervised-IN baseline (+3.1 AP50) and SimCLR (+2.3 AP50).
- On NYU v2 depth estimation, BYOL is better or on par with other methods for each metric. For instance, the challenging percent of pixels (pct.) <1.25 measure is respectively improved by +3.5 points and +1.3 points compared to supervised and SimCLR baselines.
3. Ablation Study
3.1. Batch Size (Left)
- The performance of SimCLR rapidly deteriorates with batch size, likely due to the decrease in the number of negative examples.
In contrast, the performance of BYOL remains stable over a wide range of batch sizes from 256 to 4096, and only drops for smaller values due to batch normalization layers in the encoder.
3.2. Image Augmentation (Right)
- Contrastive methods are sensitive to the choice of image augmentations. For instance, SimCLR does not work well when removing color distortion from its image augmentations.
- Instead, BYOL is incentivized to keep any information captured by the target representation into its online network, to improve its predictions. Therefore, even if augmented views of a same image share the same color histogram, BYOL is still incentivized to retain additional features in its representation. For that reason, BYOL is more robust to the choice of image augmentations than contrastive methods.
- BYOL uses the projected representation of a target network, whose weights are an exponential moving average of the weights of the online network, as target for its predictions.
This way, the weights of the target network represent a delayed and more stable version of the weights of the online network.
- There is a trade-off between updating the targets too often and updating them too slowly.
All values of the decay rate between 0.9 and 0.999 yield performance above 68.4% top-1 accuracy at 300 epochs.
3.4. Comparison With SimCLR
- The objective is extended with InfoNCE loss for fair comparison:
- where α>0 is a fixed temperature, β∈[0, 1] a weighting coefficient, B the batch size.
- Sθ quantifies pairwise similarity between augmented views. Particularly, for any augmented view u, the normalized dot product is used:
- The SimCLR loss is recovered with Φ(u1)=zθ(u1) (no predictor), Ψ(u2)=zθ(u2) (no target network) and β=1.
The only variant that performs well without negative examples (i.e., with β=0) is BYOL, using both a bootstrap target network and a predictor.
- Adding the negative pairs to BYOL’s loss without re-tuning the temperature parameter hurts its performance.
3.5. Projector Depth and Dimensions
Using the default projector and predictor of depth 2 yields the best performance. Dimension of 512 get the best accuracy.
- (There are still many results in the appendix, please free feel to read the paper.)
Using the Teacher Student architecture, BYOL outperforms SimCLR without the use of negative samples for contrastive learning.
(This architecture is similar to the one in BoWNet but BoWNet uses the bag of visual words. On the other hand, this architecture is similar to the one in Mean Teacher, a semi-supervised learning method where EMA is also used for weight update. Yet, Mean Teacher uses the classification loss as well.)
[2020 NeurIPS] [BYOL]
Bootstrap Your Own Latent A New Approach to Self-Supervised Learning