# Brief Review — SeLa: Self-Labelling via Simultaneous Clustering and Representation Learning

## SeLa, Using Sinkhorn Distance for Self-Supervised Learning

--

Self-Labelling via Simultaneous Clustering and Representation LearningSeLa, by University of Oxford2020 ICLR, Over 570 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning1993…2022[BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM]

- The method is obtained by
**maximizing the information between labels and input data indices**. - This criterion
**extends standard cross-entropy minimization to an optimal transport problem**, which can solve efficiently for**millions of input images**and**thousands of labels**using Sinkhorn Distance, a fast variant of the**Sinkhorn-Knopp**algorithm. - SeLa has the
**blog**,

# Outline

**Preliminaries & Supervised Learning****Self-Labelling (SeLa)****Results**

# 1. Preliminaries & Supervised Learning

- Formally, consider a deep neural network
*x*=*Φ*(*I*) mapping data*I*(e.g. images) to feature vectors*x**N*number of data points*I*1, …,*IN***labels**, drawn from a space of*y*1, …,*yN*∈ {1, …,*K*}The representation is followed by a*K*possible labels.**classification head**:*h*

- The model and head parameters are learned by
**minimizing the average cross-entropy loss:**

When labels are unavailable, we require aself-labelling mechanismto assign the labels automatically. Insemi-supervised learning, self-labelling can workif at least part of the labels are known. However, in thefully unsupervisedcase, it leads to adegenerate solution.

# 2. Self-Labelling (SeLa)

## 2.1. Sinkhorn Distance

To address this issue, the above CE loss is rewritten by

encoding the labels as posterior distributionsq(y|xi):

**Constraint is added**so that the**label assignments must partition the data in equally-sized subsets**to**avoid degeneracy**. Formally:

- The constraints mean that
**each data point**and that, overall, the*xi*is assigned to exactly one label*N*data points are split uniformly among the*K*classes.

This is an instance of the

optimal transport problem, which can be solved relatively efficiently.

- In order to see this more clearly, let
*Pyi*=p(*y*|*xi*)×1/*N*estimated by the model. Likewise, let*K*×*N*matrix of joint probabilitiesbe*Qyi*=q(*y*|*xi*)×1/*N**K*×*N*matrix of assigned joint probabilities.

Using the notation in

Sinkhorn Distance,matrix:Qis relaxed to be an element of the transportation polytope

- where 1 are vectors of all ones, so that
, respectively.*r*and*c*are the marginal projections of matrix*Q*onto its rows and columns is required to be*Q***a matrix of conditional probability distributions**that**split the data uniformly**, which is captured by:

**The objective function in eq. (3) is rewritten**, up to a**constant shift**:

- where
**<>**is the**Frobenius dot-product**between two matrices and log is applied element-wise.

- This becomes a linear program. In practice, however, the resulting
**linear program is large**, involving**millions of data points**and**thousands of classes**.

This issue is

solved by adopting a fast version of the Sinkhorn-Knopp algorithm inSinkhorn Distance. This amounts to introducing aregularization term:

- where
**KL**is the**Kullback-Leibler divergence**andcan be interpreted as a*rc*^T.*K*×*N*probability matrix - The advantage of this regularization term is that the
**minimizer of the above equation is written**as:

- The vectors
*α*and*β*can be obtained, as shown below, via**a simple matrix scaling iteration**. Choosing*λ*trades off convergence speed with closeness.

## 2.2. Iterative Processing

Step 1: representation learning. Given the current label assignmentsQ,the model is updated by minimizing eq. (6) with respect to (the parameters of). This ish○Φthe same as training the model using the common cross-entropy loss for classification.

Step 2: self-labelling. Given the current model, theh○Φlog probabilities. Then,Pis computedfindQusing eq. (8) by iterating the updates usingSinkhorn Distance:

- In practice,
**convergence of finding**is reached*Q***within 2 minutes on ImageNet**when computed on a GPU.

# 3. Results

## 3.1. Ablation Studies

- “
**SeLa[**is SeLa with the*K*×*T*]”**number of clusters**and the*K***number of clustering heads***T*.

Table 2: Movingfrom 1k to 3k improves the results, but larger numbers decrease the quality slightly.

Table 3: Number of heads from: +2% for AlexNet and +10% for ResNet.T= 1 toT= 10 yields a large performance gain

Table 1: The number of times theself-labelling algorithm (step 2) is variedto run during training(#opts), from zero to once per step 1 epoch. Self-labelling is essential, with thebest value around 80(for 160 step 1 epochs in total).

Table 4:SeLa works well in all architectures.

Table 5: Given thelabels assigned by applying SeLa toAlexNet, we can re-trainAlexNetfrom scratch using a shorter 90-epochs schedulewith achieving thesame final accuracy.

## 3.2. Small Datasets

- SeLa is used with the settings [128×10] for CIFAR-10, [512×10] for CIFAR-100 and [128×1] for SVHN.

SeLa outperforms the best previous method by 5.8% for CIFAR-10, by 9.5% for CIFAR-100 and by 0.8% for SVHN.

## 3.3. Large Datasets

Across both datasets,

SeLa outperformsDeepClusterandLocal Aggregationat every layer.

SeLa withRotNetget evenbetterresults.

With top-1 accuracy of 61.5%, S

eLa outperformsthan all other methods includingLocal Aggregation,CPCv1andMoCothat use the same level of data augmentation.SeLa

even outperforms larger architectures such asBigBiGAN’sRevNet-50×4and reachclose tothe performance of models usingAutoAugment-style transformations.

## 3.4. Fine-Tuning on **PASCAL VOC**

As in the linear probe experiments,

SeLa is better than the current state-of-the-art in detection and classificationwith both fine-tuning only the last fully connected layers and when fine-tuning the whole network.SeLa

does not only learn useful feature representations but is also able to perform well when fine-tuned on actual down-stream tasks.