Brief Review — SeLa: Self-Labelling via Simultaneous Clustering and Representation Learning
SeLa, Using Sinkhorn Distance for Self-Supervised Learning
Self-Labelling via Simultaneous Clustering and Representation Learning
SeLa, by University of Oxford
2020 ICLR, Over 570 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM] [LDBM]
- The method is obtained by maximizing the information between labels and input data indices.
- This criterion extends standard cross-entropy minimization to an optimal transport problem, which can solve efficiently for millions of input images and thousands of labels using Sinkhorn Distance, a fast variant of the Sinkhorn-Knopp algorithm.
- SeLa has the blog, which shows their results interactively: https://www.robots.ox.ac.uk/~vgg/blog/self-labelling-via-simultaneous-clustering-and-representation-learning.html
Outline
- Preliminaries & Supervised Learning
- Self-Labelling (SeLa)
- Results
1. Preliminaries & Supervised Learning
- Formally, consider a deep neural network x = Φ(I) mapping data I (e.g. images) to feature vectors x. The dataset has N number of data points I1, …, IN, with corresponding labels y1, …, yN ∈ {1, …, K}, drawn from a space of K possible labels. The representation is followed by a classification head h:
- The model and head parameters are learned by minimizing the average cross-entropy loss:
When labels are unavailable, we require a self-labelling mechanism to assign the labels automatically. In semi-supervised learning, self-labelling can work if at least part of the labels are known. However, in the fully unsupervised case, it leads to a degenerate solution.
2. Self-Labelling (SeLa)
2.1. Sinkhorn Distance
To address this issue, the above CE loss is rewritten by encoding the labels as posterior distributions q(y|xi):
- Constraint is added so that the label assignments must partition the data in equally-sized subsets to avoid degeneracy. Formally:
- The constraints mean that each data point xi is assigned to exactly one label and that, overall, the N data points are split uniformly among the K classes.
This is an instance of the optimal transport problem, which can be solved relatively efficiently.
- In order to see this more clearly, let Pyi=p(y|xi)×1/N be the K×N matrix of joint probabilities estimated by the model. Likewise, let Qyi=q(y|xi)×1/N be K×N matrix of assigned joint probabilities.
Using the notation in Sinkhorn Distance, matrix Q is relaxed to be an element of the transportation polytope:
- where 1 are vectors of all ones, so that r and c are the marginal projections of matrix Q onto its rows and columns, respectively.
- Q is required to be a matrix of conditional probability distributions that split the data uniformly, which is captured by:
- The objective function in eq. (3) is rewritten, up to a constant shift:
- where <> is the Frobenius dot-product between two matrices and log is applied element-wise.
- This becomes a linear program. In practice, however, the resulting linear program is large, involving millions of data points and thousands of classes.
This issue is solved by adopting a fast version of the Sinkhorn-Knopp algorithm in Sinkhorn Distance. This amounts to introducing a regularization term:
- where KL is the Kullback-Leibler divergence and rc^T can be interpreted as a K×N probability matrix.
- The advantage of this regularization term is that the minimizer of the above equation is written as:
- The vectors α and β can be obtained, as shown below, via a simple matrix scaling iteration. Choosing λ trades off convergence speed with closeness.
2.2. Iterative Processing
Step 1: representation learning. Given the current label assignments Q, the model is updated by minimizing eq. (6) with respect to (the parameters of) h○Φ. This is the same as training the model using the common cross-entropy loss for classification.
Step 2: self-labelling. Given the current model h○Φ, the log probabilities P is computed. Then, find Q using eq. (8) by iterating the updates using Sinkhorn Distance:
- In practice, convergence of finding Q is reached within 2 minutes on ImageNet when computed on a GPU.
3. Results
3.1. Ablation Studies
- “SeLa[K×T]” is SeLa with the number of clusters K and the number of clustering heads T.
Table 2: Moving from 1k to 3k improves the results, but larger numbers decrease the quality slightly.
Table 3: Number of heads from T = 1 to T = 10 yields a large performance gain: +2% for AlexNet and +10% for ResNet.
Table 1: The number of times the self-labelling algorithm (step 2) is varied to run during training (#opts), from zero to once per step 1 epoch. Self-labelling is essential, with the best value around 80 (for 160 step 1 epochs in total).
Table 4: SeLa works well in all architectures.
Table 5: Given the labels assigned by applying SeLa to AlexNet, we can re-train AlexNet from scratch using a shorter 90-epochs schedule with achieving the same final accuracy.
3.2. Small Datasets
- SeLa is used with the settings [128×10] for CIFAR-10, [512×10] for CIFAR-100 and [128×1] for SVHN.
SeLa outperforms the best previous method by 5.8% for CIFAR-10, by 9.5% for CIFAR-100 and by 0.8% for SVHN.
3.3. Large Datasets
Across both datasets, SeLa outperforms DeepCluster and Local Aggregation at every layer.
SeLa with RotNet get even better results.
With top-1 accuracy of 61.5%, SeLa outperforms than all other methods including Local Aggregation, CPCv1 and MoCo that use the same level of data augmentation.
SeLa even outperforms larger architectures such as BigBiGAN’s RevNet-50×4 and reach close to the performance of models using AutoAugment-style transformations.
3.4. Fine-Tuning on PASCAL VOC
As in the linear probe experiments, SeLa is better than the current state-of-the-art in detection and classification with both fine-tuning only the last fully connected layers and when fine-tuning the whole network.
SeLa does not only learn useful feature representations but is also able to perform well when fine-tuned on actual down-stream tasks.