Brief Review — SeLa: Self-Labelling via Simultaneous Clustering and Representation Learning

SeLa, Using Sinkhorn Distance for Self-Supervised Learning

Sik-Ho Tsang
6 min readAug 23


SeLa by VGG (Figure from VGG Blog)

Self-Labelling via Simultaneous Clustering and Representation Learning
SeLa, by University of Oxford
2020 ICLR, Over 570 Citations (Sik-Ho Tsang @ Medium)

Self-Supervised Learning
19932022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM]


  1. Preliminaries & Supervised Learning
  2. Self-Labelling (SeLa)
  3. Results

1. Preliminaries & Supervised Learning

  • Formally, consider a deep neural network x = Φ(I) mapping data I (e.g. images) to feature vectors x. The dataset has N number of data points I1, …, IN, with corresponding labels y1, …, yN ∈ {1, …, K}, drawn from a space of K possible labels. The representation is followed by a classification head h:
  • The model and head parameters are learned by minimizing the average cross-entropy loss:

When labels are unavailable, we require a self-labelling mechanism to assign the labels automatically. In semi-supervised learning, self-labelling can work if at least part of the labels are known. However, in the fully unsupervised case, it leads to a degenerate solution.

2. Self-Labelling (SeLa)

2.1. Sinkhorn Distance

To address this issue, the above CE loss is rewritten by encoding the labels as posterior distributions q(y|xi):

  • Constraint is added so that the label assignments must partition the data in equally-sized subsets to avoid degeneracy. Formally:
  • The constraints mean that each data point xi is assigned to exactly one label and that, overall, the N data points are split uniformly among the K classes.

This is an instance of the optimal transport problem, which can be solved relatively efficiently.

  • In order to see this more clearly, let Pyi=p(y|xi)×1/N be the K×N matrix of joint probabilities estimated by the model. Likewise, let Qyi=q(y|xi)×1/N be K×N matrix of assigned joint probabilities.

Using the notation in Sinkhorn Distance, matrix Q is relaxed to be an element of the transportation polytope:

  • where 1 are vectors of all ones, so that r and c are the marginal projections of matrix Q onto its rows and columns, respectively.
  • Q is required to be a matrix of conditional probability distributions that split the data uniformly, which is captured by:
  • The objective function in eq. (3) is rewritten, up to a constant shift:
  • where <> is the Frobenius dot-product between two matrices and log is applied element-wise.
  • This becomes a linear program. In practice, however, the resulting linear program is large, involving millions of data points and thousands of classes.

This issue is solved by adopting a fast version of the Sinkhorn-Knopp algorithm in Sinkhorn Distance. This amounts to introducing a regularization term:

  • where KL is the Kullback-Leibler divergence and rc^T can be interpreted as a K×N probability matrix.
  • The advantage of this regularization term is that the minimizer of the above equation is written as:
  • The vectors α and β can be obtained, as shown below, via a simple matrix scaling iteration. Choosing λ trades off convergence speed with closeness.

2.2. Iterative Processing

Step 1: representation learning. Given the current label assignments Q, the model is updated by minimizing eq. (6) with respect to (the parameters of) hΦ. This is the same as training the model using the common cross-entropy loss for classification.

Step 2: self-labelling. Given the current model hΦ, the log probabilities P is computed. Then, find Q using eq. (8) by iterating the updates using Sinkhorn Distance:

  • In practice, convergence of finding Q is reached within 2 minutes on ImageNet when computed on a GPU.

3. Results

3.1. Ablation Studies

Ablation Studies
  • SeLa[K×T]” is SeLa with the number of clusters K and the number of clustering heads T.

Table 2: Moving from 1k to 3k improves the results, but larger numbers decrease the quality slightly.

Table 3: Number of heads from T = 1 to T = 10 yields a large performance gain: +2% for AlexNet and +10% for ResNet.

Table 1: The number of times the self-labelling algorithm (step 2) is varied to run during training (#opts), from zero to once per step 1 epoch. Self-labelling is essential, with the best value around 80 (for 160 step 1 epochs in total).

Table 4: SeLa works well in all architectures.

Table 5: Given the labels assigned by applying SeLa to AlexNet, we can re-train AlexNet from scratch using a shorter 90-epochs schedule with achieving the same final accuracy.

3.2. Small Datasets

AlexNet on Small datasets
  • SeLa is used with the settings [128×10] for CIFAR-10, [512×10] for CIFAR-100 and [128×1] for SVHN.

SeLa outperforms the best previous method by 5.8% for CIFAR-10, by 9.5% for CIFAR-100 and by 0.8% for SVHN.

3.3. Large Datasets

AlexNet on Large Datasets

Across both datasets, SeLa outperforms DeepCluster and Local Aggregation at every layer.

SeLa with RotNet get even better results.

ResNet on Large Datasets

With top-1 accuracy of 61.5%, SeLa outperforms than all other methods including Local Aggregation, CPCv1 and MoCo that use the same level of data augmentation.

SeLa even outperforms larger architectures such as BigBiGAN’s RevNet-50×4 and reach close to the performance of models using AutoAugment-style transformations.

3.4. Fine-Tuning on PASCAL VOC


As in the linear probe experiments, SeLa is better than the current state-of-the-art in detection and classification with both fine-tuning only the last fully connected layers and when fine-tuning the whole network.

SeLa does not only learn useful feature representations but is also able to perform well when fine-tuned on actual down-stream tasks.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.