It is hypothesized that BN is critical to prevent collapse in BYOLwhere BN flows gradients across batch elements, and could leak information about negative views in the batch.
In this tech report, it shows that even without the use of BN, but by using GN+WS, which does not use batch statistics, BYOL also performs well.
BYOL trains its representation using both an online network (parameterized by θ) and a target network (parameterized by ξ).
As a part of the online network, it further defines a predictor network qθ that is used to predict target projections zξ′ using online projections zθ as inputs.
Accordingly, the parameters of the online projection are updated following the gradients of the prediction loss.
In turn, the target networkweights are updated as an exponential moving average (EMA) of the online network’s weights:
with η being a decay parameter.
As qθ(zθ) is a function of v and zξ′ is a function of v′, BYOL loss can be seen as a measure of similarity between the views v and v′ and therefore resembles the positive term of the InfoNCE loss (Contrastive):
But BYOL has an advantage over contrastive learning that BYOL does not need negative samples. Consequently, BYOL does not need large batch size large memory to store the image representations of negative samples.
For an activation tensor X of dimensions (N, H, W, C), GN first splits channels into G equally-sized groups, then normalizes activations with the mean and standard deviation computed over disjoint slices of size (1, H, W, C/G).
Ablation results on normalization, per network component: Top-1 linear accuracy on ImageNet at 300 epochs
First, it is observed that removing all instances of BN in BYOL leads to performance (0.1%) that is no better than random. This is specific to BYOL as SimCLR still performs reasonably well in this regime.
Nevertheless, solely applying BN to the ResNet encoder is enough for BYOL to achieve high performance (72.1%). It is hypothesized that the main contribution of BN in BYOL is to compensate for improper initialization
3.2. Proper Initialization Allows Working Without BN
Top-1 linear accuracy on ImageNet at 1000 epochs
To confirm this assumption, a protocol is designed to mimic the effect of BN on initial scalings and training dynamics, without using or backpropagating through batch statistics .
Before training, per-activation BN statistics for each layer is computed by running a single forward pass of the network with BN on a batch of augmented data. Then BN layers are removed, but retain the scale γ and offset β parameters and trainable.
Despite its comparatively low performance (65.7%), the trained representation still provides considerably better classification results than a randomResNet-50 backbone, and is thus necessarily not collapsed.
This confirms that BYOL does not need BN to prevent collapse. Thus, authors try to explore other refined element-wise normalization procedures.
3.3. Using GN with WS Leads to Competitive Performance
More precisely, WSis applied to convolutional and linear parameters by weight standardized alternatives, and all BNs are replaced by GN layers (G=16).
BYOL+GN+WSachieves 73.9% top-1 accuracy after 1000 epochs.
BYOL can maintain most of its performance even without a hypothetical implicit contrastive term provided by BN.