Review — Whitening for Self-Supervised Representation Learning

Whitening MSE (W-MSE), No Asymmetric Network, No Negative Samples

5 min readAug 25, 2022

Whitening for Self-Supervised Representation Learning
W-MSE, by University of Trento,
2021 ICML, Over 70 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Representation Learning, Unsupervised Learning

A new loss function W-MSE is proposed, which is based on the whitening of the latent-space features. The whitening operation has a “scattering” effect on the batch samples, avoiding degenerate solutions where all the sample representations collapse to a single point.
The proposed solution does not require asymmetric networks, also it does not need negative samples.

Outline

Whitening MSE (W-MSE)
Experimental Results

1. Whitening MSE (W-MSE)

1.1. Preliminaries

Given an image x, an embedding z=f(x; θ) is extracted using an encoder network f(;θ), what the network targets: 1) the image embeddings are drawn from a non-degenerate distribution, e.g., all the representations collapse to a single point), and (2) positive image pairs (xi, xj), which share a similar semantics, should be clustered close to each other.
The problem is formulated as (Eq. 1):

where where dist() is a distance between vectors, e.g. cosine similarity, I is the identity matrix and (zi, zj) corresponds to a positive pair of images (xi, xj).

First, positive samples sharing the same semantics from a single image x are obtained using standard image transformation techniques T(;p) where the parameters (p) are selected uniformly at random.
Thus, the augmented positive samples are xi=T(x; pi).
The number of positive samples per image d may vary. In the proposed MSE-based loss, all the possible d(d-1)/2 combinations of positive samples are used. d=2 (1 positive pair) and d=4 (6 positive pairs) are tried.
More precisely, an encoder is used for extracting representation: h=E(x) with 512 or 2048 high-dimensional vector.
A nonlinear projection head g() is used to project h in a lower dimensional space. This is implemented with a MLP with one hidden layer and a BN layer.
Thus, f is the composition of E and g.

1.2. W-MSE

Given N original images and a batch of samples B={x1, …, xK}, where K=Nd, let V={v1, …, vK}, be the corresponding batch of features obtained.
In the proposed W-MSE loss, the MSE is computed over all Nd(d-1)/2 positive pairs, where constraint (at Eq. 1) is satisfied using the reparameterization of the v variables with the whitened variables z:

where the sum is over (vi, vj)∈V, z=Whitening(v):

μV is the mean of the elements in V:

And:

Being ΣV the covariance matrix of V:

Whitening() performs the full whitening of each vi ∈ V and the resulting set of vectors Z={z1, …, zK} lies in a zero-centered distribution with a covariance matrix equal to the identity matrix.

**A schematic representation of the W-MSE based optimization process**

Positive pairs are indicated with the same shapes and colors.
(1) A representation of the batch features in V when training starts.
(2, 3) The distribution of the elements after whitening and the L2 normalization.
(4) The MSE computed over the normalized z features encourages the network to move the positive pair representations closer to each other.
(5) The subsequent iterations move closer and closer the positive pairs, while the relative layout of the other samples is forced to lie in a spherical distribution.
(6) Final results.

The intuition behind the proposed loss penalizes positives which are far apart from each other, thus leading g(E()) to shrink the inter-positive distances. On the other hand, since Z must lie in a spherical distribution, the other samples should be “moved” and rearranged in order to satisfy constraint (Eq. 1).

1.3. Batch Slicing

V is first partitioned in d parts (d=2 in this example).
The first part is randomly permuted and the same permutation is applied to the other d-1 parts. Then, all the partitions are further split and sub-batches (Vi) are created. Each Vi is independently used to compute the sub-batch specific whitening matrix WiV and centroid μiV.
Finally, it is possible to repeat the whole operation several times and to average the result to get a more robust estimate of L_W-MSE.

2. Experimental Results

2.1. Comparison with SOTA

**Classification accuracy (top 1) of a linear classifier and a 5-nearest neighbors classifier for different loss functions and datasets with a** **ResNet-18 encoder**

For W-MSE, 4 samples are generally better than 2.
The contrastive loss performs the worst in most cases.

The W-MSE 4 accuracy is the best on CIFAR-10 and CIFAR-100, while BYOL leads on STL-10 and Tiny ImageNet, although the gap between the two methods is marginal.

**Classification accuracy on ImageNet-100**

W-MSE uses ResNet-18 while other results use ResNet-50.

Despite this large difference in the encoder capacity, both versions of W-MSE significantly outperform the other two compared methods in this dataset.

**Classification accuracy (top 1) of a linear classifier on ImageNet with a** **ResNet-50 encoder**

W-MSE 4 is the state of the art with 100 epochs and it is very close to the 400-epochs state of the art.

These results confirm that W-MSE method is highly competitive, considering that it has no intensively tuned hyperparameters and that the proposed network is much simpler than other approaches.