# Review — Whitening for Self-Supervised Representation Learning

## Whitening MSE (W-MSE), No Asymmetric Network, No Negative Samples

--

Whitening for Self-Supervised Representation Learning, by University of Trento,

W-MSE2021 ICML, Over 70 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning, Representation Learning, Unsupervised Learning

- A
**new loss function W-MSE**is proposed, which is based on the**whitening of the latent-space features**. The whitening operation has a “scattering” effect on the batch samples,**avoiding degenerate solutions****where all the sample representations collapse to a single point.** - The proposed solution
**does not require asymmetric networks**, also it**does not need negative samples**.

# Outline

**Whitening MSE (W-MSE)****Experimental Results**

**1. Whitening MSE (W-MSE)**

## 1.1. Preliminaries

- Given an
**image**,*x***an embedding**is extracted using an*z*=*f*(*x*;*θ*)**encoder network**, what the network targets: 1)*f*(;*θ*)**the image embeddings are drawn from a non-degenerate distribution**, e.g., all the representations collapse to a single point), and (2)**positive image pairs (**, which*xi*,*xj*)**share a similar semantics**, should be clustered**close to each other**. - The problem is formulated as (Eq. 1):

- where where
**dist()**is a**distance between vectors, e.g. cosine similarity**,*I*is the identity matrix and**(**corresponds to*zi*,*zj*)**a positive pair of images (**.*xi*,*xj*)

- First,
**positive samples**sharing the same semantics**from a single image**are*x***obtained using standard image transformation**techniques*T*(;*p*) where the parameters (*p*) are selected uniformly at random. - Thus, the augmented positive samples are
*xi*=*T*(*x*;*pi*) - The number of positive samples per image
*d*may vary. In the proposed MSE-based loss,**all the possible**.*d*(*d*-1)/2 combinations of positive samples are usedand*d*=2 (1 positive pair)are tried.*d*=4 (6 positive pairs) - More precisely,
**an encoder is used for extracting representation:**with 512 or 2048 high-dimensional vector.*h*=*E*(x) **A nonlinear projection head**is used to*g*()**project**This is implemented with a MLP with one hidden layer and a BN layer.*h*in a lower dimensional space.- Thus,
*f*is the composition of*E*and*g*.

## 1.2. W-MSE

- Given
and*N*original images**a batch of samples**, where*B*={*x*1, …,*xK*}*K*=*Nd*, let*V*={*v*1, …,*vK*}, be the corresponding batch of features obtained. - In the proposed W-MSE loss, the
**MSE is computed over all**, where*Nd*(*d-*1)/2 positive pairs**constraint (at Eq. 1) is satisfied using the reparameterization of the**:*v*variables with the whitened variables*z*

- where the sum is over (
*vi*,*vj*)*∈V*,:*z*=Whitening(*v*)

is the*μV***mean of the elements in**:*V*

- And:

- Being
the*ΣV***covariance matrix of**:*V*

Whitening() performs the full whitening of each

vi∈ Vand the resulting set of vectorsZ={z1, …,zK} lies in a zero-centered distribution with a covariance matrix equal to the identity matrix.

- Positive pairs are indicated with the same shapes and colors.
**(1)**A representation of the batch features in*V*when training starts.**(2, 3)**The distribution of the elements after whitening and the L2 normalization.**(4)**The MSE computed over the normalized*z*features encourages the network to move the positive pair representations closer to each other.**(5)**The subsequent iterations**move closer and closer the positive pairs**, while the relative layout of the other samples is forced to lie in a spherical distribution.- (6) Final results.

The

intuitionbehindthe proposed loss penalizes positives which are far apart from each other, thus leadingg(E()) to shrink the inter-positive distances. On the other hand, sinceZmust lie in a spherical distribution, theother samples should be “moved” and rearranged in order to satisfy constraint (Eq. 1).

## 1.3. Batch Slicing

is first*V***partitioned in**(*d*partsin this example).*d*=2- The first part is
**randomly permuted**and the same permutation is applied to the other*d*-1 parts. Then, all the partitions are**further split**and**sub-batches (**. Each*Vi*) are created*Vi*is independently used to compute the sub-batch specific**whitening matrix**nd*WiV*a**centroid**.*μiV* - Finally, it is possible to
**repeat the whole operation several times**and to**average the result**to get a**more robust estimate of**.*L_W-MSE*

# 2. Experimental Results

## 2.1. Comparison with SOTA

- For W-MSE, 4 samples are generally better than 2.
- The contrastive loss performs the worst in most cases.

The W-MSE 4 accuracy is the best on CIFAR-10 and CIFAR-100, while BYOL leads on STL-10 and Tiny ImageNet, although the gap between the two methods is marginal.

Despite this large difference in the encoder capacity,

both versions of W-MSE significantly outperform the other two compared methodsin this dataset.

- W-MSE 4 is the state of the art with 100 epochs and it is very close to the 400-epochs state of the art.

These results confirm that

W-MSE method is highly competitive, considering that it hasno intensively tuned hyperparametersand that the proposed network ismuch simplerthan other approaches.

## 2.2. Contrastive Loss With Whitening

The whitening transform alone

does not improve the SSL performance, and, used jointly with negative contrasting, it may beharmful.

W-MSE tries to have another direction for self-supervised learning.

## Reference

[2021 ICML] [W-MSE]

Whitening for Self-Supervised Representation Learning

## 1.2. Unsupervised/Self-Supervised Learning

**1993** … **2021** [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE]