Brief Review — SeLa: Self-Labelling via Simultaneous Clustering and Representation Learning

SeLa, Using Sinkhorn Distance for Self-Supervised Learning

6 min readAug 23, 2023

Self-Labelling via Simultaneous Clustering and Representation Learning
SeLa, by University of Oxford
2020 ICLR, Over 570 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM]

The method is obtained by maximizing the information between labels and input data indices.
This criterion extends standard cross-entropy minimization to an optimal transport problem, which can solve efficiently for millions of input images and thousands of labels using Sinkhorn Distance, a fast variant of the Sinkhorn-Knopp algorithm.
SeLa has the blog, which shows their results interactively: https://www.robots.ox.ac.uk/~vgg/blog/self-labelling-via-simultaneous-clustering-and-representation-learning.html

Outline

Preliminaries & Supervised Learning
Self-Labelling (SeLa)
Results

1. Preliminaries & Supervised Learning

Formally, consider a deep neural network x = Φ(I) mapping data I (e.g. images) to feature vectors x. The dataset has N number of data points I1, …, IN, with corresponding labels y1, …, yN ∈ {1, …, K}, drawn from a space of K possible labels. The representation is followed by a classification head h:

The model and head parameters are learned by minimizing the average cross-entropy loss:

When labels are unavailable, we require a self-labelling mechanism to assign the labels automatically. In semi-supervised learning, self-labelling can work if at least part of the labels are known. However, in the fully unsupervised case, it leads to a degenerate solution.

2. Self-Labelling (SeLa)

2.1. Sinkhorn Distance

To address this issue, the above CE loss is rewritten by encoding the labels as posterior distributions q(y|xi):

Constraint is added so that the label assignments must partition the data in equally-sized subsets to avoid degeneracy. Formally:

The constraints mean that each data point xi is assigned to exactly one label and that, overall, the N data points are split uniformly among the K classes.

This is an instance of the optimal transport problem, which can be solved relatively efficiently.

In order to see this more clearly, let Pyi=p(y|xi)×1/N be the K×N matrix of joint probabilities estimated by the model. Likewise, let Qyi=q(y|xi)×1/N be K×N matrix of assigned joint probabilities.

Using the notation in Sinkhorn Distance, matrix Q is relaxed to be an element of the transportation polytope:

where 1 are vectors of all ones, so that r and c are the marginal projections of matrix Q onto its rows and columns, respectively.
Q is required to be a matrix of conditional probability distributions that split the data uniformly, which is captured by:

The objective function in eq. (3) is rewritten, up to a constant shift:

where <> is the Frobenius dot-product between two matrices and log is applied element-wise.

This becomes a linear program. In practice, however, the resulting linear program is large, involving millions of data points and thousands of classes.

This issue is solved by adopting a fast version of the Sinkhorn-Knopp algorithm in Sinkhorn Distance. This amounts to introducing a regularization term:

where KL is the Kullback-Leibler divergence and rc^T can be interpreted as a K×N probability matrix.
The advantage of this regularization term is that the minimizer of the above equation is written as:

The vectors α and β can be obtained, as shown below, via a simple matrix scaling iteration. Choosing λ trades off convergence speed with closeness.

2.2. Iterative Processing

Step 1: representation learning. Given the current label assignments Q, the model is updated by minimizing eq. (6) with respect to (the parameters of) h○Φ. This is the same as training the model using the common cross-entropy loss for classification.
Step 2: self-labelling. Given the current model h○Φ, the log probabilities P is computed. Then, find Q using eq. (8) by iterating the updates using Sinkhorn Distance:

In practice, convergence of finding Q is reached within 2 minutes on ImageNet when computed on a GPU.

3. Results

3.1. Ablation Studies

“SeLa[K×T]” is SeLa with the number of clusters K and the number of clustering heads T.

Table 2: Moving from 1k to 3k improves the results, but larger numbers decrease the quality slightly.
Table 3: Number of heads from T = 1 to T = 10 yields a large performance gain: +2% for AlexNet and +10% for ResNet.
Table 1: The number of times the self-labelling algorithm (step 2) is varied to run during training (#opts), from zero to once per step 1 epoch. Self-labelling is essential, with the best value around 80 (for 160 step 1 epochs in total).
Table 4: SeLa works well in all architectures.
Table 5: Given the labels assigned by applying SeLa to AlexNet, we can re-train AlexNet from scratch using a shorter 90-epochs schedule with achieving the same final accuracy.

3.2. Small Datasets

SeLa is used with the settings [128×10] for CIFAR-10, [512×10] for CIFAR-100 and [128×1] for SVHN.

SeLa outperforms the best previous method by 5.8% for CIFAR-10, by 9.5% for CIFAR-100 and by 0.8% for SVHN.

3.3. Large Datasets

Across both datasets, SeLa outperforms DeepCluster and Local Aggregation at every layer.
SeLa with RotNet get even better results.

With top-1 accuracy of 61.5%, SeLa outperforms than all other methods including Local Aggregation, CPCv1 and MoCo that use the same level of data augmentation.
SeLa even outperforms larger architectures such as BigBiGAN’s RevNet-50×4 and reach close to the performance of models using AutoAugment-style transformations.

3.4. Fine-Tuning on PASCAL VOC

As in the linear probe experiments, SeLa is better than the current state-of-the-art in detection and classification with both fine-tuning only the last fully connected layers and when fine-tuning the whole network.
SeLa does not only learn useful feature representations but is also able to perform well when fine-tuned on actual down-stream tasks.

Brief Review — SeLa: Self-Labelling via Simultaneous Clustering and Representation Learning

SeLa, Using Sinkhorn Distance for Self-Supervised Learning

Outline

1. Preliminaries & Supervised Learning

2. Self-Labelling (SeLa)

2.1. Sinkhorn Distance

2.2. Iterative Processing

3. Results

3.1. Ablation Studies

3.2. Small Datasets

3.3. Large Datasets

3.4. Fine-Tuning on PASCAL VOC

Written by Sik-Ho Tsang

No responses yet