Review — DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features
DeepCluster, K-Mean Clustering to Generate Pseudo-Labels, a Pretext Task for Self-Supervised Learning
In this story, Deep Clustering for Unsupervised Learning of Visual Features, DeepCluster, by Facebook AI Research, is reviewed. In this paper:
- DeepCluster, a clustering method is proposed that jointly learns the parameters of a neural network and the cluster assignments of the resulting features.
- DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network.
This is a paper in 2018 ECCV with over 900 citations. (Sik-Ho Tsang @ Medium)
Outline
- Notations for Supervised Learning
- DeepCluster as Pretext Task in Self-Supervised Learning
- DeepCluster Analysis
- DeepCluster Performance
1. Notations for Supervised Learning
- Before talking about DeepCluster, let’s define some notations using supervised learning.
- Given a training set X = {x1, x2, ..., xN} of N images, we want to find a parameter θ∗ such that the mapping f produces good general-purpose features.
- These parameters are traditionally learned with supervision, i.e. each image xn is associated with a label yn in {0, 1}^k. This label represents the image’s membership to one of k possible predefined classes.
- A parametrized classifier gW predicts the correct labels on top of the features f(xn).
- Therefore, the loss function is (Eq. (1)):
- where ℓ is the multinomial logistic loss.
- This cost function is minimized using mini-batch stochastic gradient descent and backpropagation to compute the gradient.
2. DeepCluster as Pretext Task in Self-Supervised Learning
2.1. DeepCluster Procedures
- The idea of this work is to exploit this weak signal to bootstrap the discriminative power of a convnet.
- We cluster the output of the convnet and use the subsequent cluster assignments as “pseudo-labels” to optimize Eq. (1). This deep clustering (DeepCluster) approach iteratively learns the features and groups them.
- A standard clustering algorithm, k-means, is used.
- k-means takes a set of vectors as input, in our case the features f(xn) produced by the convnet, and clusters them into k distinct groups based on a geometric criterion.
- More precisely, it jointly learns a d×k centroid matrix C and the cluster assignments yn of each image n by solving the following problem (Eq. (2)):
Overall, DeepCluster alternates between clustering the features to produce pseudo-labels using Eq. (2) and updating the parameters of the convnet by predicting these pseudo-labels using Eq. (1).
2.2. Avoiding Trivial Solutions
2.2.1. Empty Cluster
- An optimal decision boundary is to assign all of the inputs to a single cluster. This issue is caused by the absence of mechanisms to prevent from empty clusters.
- More precisely, when a cluster becomes empty, a non-empty cluster is randomly selected and its centroid is used with a small random perturbation as the new centroid for the empty cluster. The points are then reassigned belonging to the non-empty cluster to the two resulting clusters.
2.2.2. Trivial Parametrization
- If the vast majority of images is assigned to a few clusters, the parameters θ will exclusively discriminate between them.
- A strategy to circumvent this issue is to sample images based on a uniform distribution over the classes, or pseudo-labels.
3. DeepCluster Analysis
3.1. Normalized Mutual Information (NMI)
- Normalized Mutual Information (NMI), is used to measure the performance:
- where I denotes the mutual information and H the entropy.
- If the two assignments A and B are independent, the NMI is equal to 0. If one of them is deterministically predictable from the other, the NMI is equal to 1.
- (a): The dependence between the clusters and the labels increases over time, showing that the learnt features progressively capture information related to object classes.
- (b): The NMI is increasing, meaning that there are less and less reassignments and the clusters are stabilizing over time.
- (c): The best performance is obtained with k= 10,000. Given that ImageNet has 1000 classes. Apparently some amount of over-segmentation is beneficial.
3.2. Visualizations
- As expected, deeper layers in the network seem to capture larger textural structures.
- The filters on the top row contain information about structures that highly correlate with object classes. The filters on the bottom row seem to trigger on style, like drawings or abstract shapes.
4. DeepCluster Performance
4.1. Linear Classification on Activations on ImageNet & Places
4.1.1. ImageNet
- A linear classifier is trained on top of different frozen convolutional layers.
- On ImageNet, DeepCluster outperforms the state of the art from conv2 to conv5 layers by 1−6%. The largest improvement is observed in the conv3 layer.
Finally, the difference of performance between DeepCluster and a supervised AlexNe tgrows significantly on higher layers: at layers conv2-conv3 the difference is only around 4%, but this difference rises to 12.3% at conv5, marking where the AlexNet probably stores most of the class level information.
- If a MLP is trained on the last layer, DeepCluster outperforms the state of the art by 8%.
4.1.2. Places
- DeepCluster yields conv3–4 features that are comparable to those trained with ImageNet labels.
This suggests that when the target task is sufficiently far from the domain covered by ImageNet, labels are less important.
4.2. Pascal VOC
DeepCluster outperforms previous unsupervised methods, such as Context Prediction [13], Context Encoders [46], Colorization [71], Split-Brain Auto [72], Jigsaw Puzzles [42], on all three tasks, in every setting.
- The improvement with fine-tuning over the state of the art is the largest on semantic segmentation (7.5%).
- On detection, DeepCluster performs only slightly better than previously published methods. Interestingly, a fine-tuned random network performs comparatively to many unsupervised methods, but performs poorly if only fc6–8 are learned.
4.3. YFCC100M
- In YFCC100M, object classes are severly unbalanced, leading to a data distribution less favorable to DeepCluster.
- This experiment validates that DeepCluster is robust to a change of image distribution, leading to state-of-the-art general-purpose visual features even if this distribution is not favorable to its design.
4.4. AlexNet vs VGGNet
- In the previous experiment, AlexNet is used. Here a deeper network VGGNet is tried.
- Training the VGG-16 with DeepCluster gives a performance above the state of the art, bringing us to only 1.4% below the supervised topline.
4.5. Image Retrieval
- The above table suggests that image retrieval is a task where the pre-training is essential and studying it as a down-stream task could give further insights about the quality of the features produced by unsupervised approaches.
One of the major issues is that k-mean clustering takes quite plenty of time.
Reference
[2018 ECCV] [DeepCluster]
Deep Clustering for Unsupervised Learning of Visual Features
Self-Supervised Learning
2008–2010 [Stacked Denoising Autoencoders] 2014 [Exemplar-CNN] 2015 [Context Prediction] 2016 [Context Encoders] [Colorization] [Jigsaw Puzzles] 2017 [L³-Net] [Split-Brain Auto] 2018 [RotNet/Image Rotations] [DeepCluster]