Review — DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features

DeepCluster, K-Mean Clustering to Generate Pseudo-Labels, a Pretext Task for Self-Supervised Learning

6 min readSep 22, 2021

--

**Illustration of the Proposed DeepCluster**

In this story, Deep Clustering for Unsupervised Learning of Visual Features, DeepCluster, by Facebook AI Research, is reviewed. In this paper:

DeepCluster, a clustering method is proposed that jointly learns the parameters of a neural network and the cluster assignments of the resulting features.
DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network.

This is a paper in 2018 ECCV with over 900 citations. (Sik-Ho Tsang @ Medium)

Outline

Notations for Supervised Learning
DeepCluster as Pretext Task in Self-Supervised Learning
DeepCluster Analysis
DeepCluster Performance

1. Notations for Supervised Learning

Before talking about DeepCluster, let’s define some notations using supervised learning.
Given a training set X = {x1, x2, ..., xN} of N images, we want to find a parameter θ∗ such that the mapping f produces good general-purpose features.
These parameters are traditionally learned with supervision, i.e. each image xn is associated with a label yn in {0, 1}^k. This label represents the image’s membership to one of k possible predefined classes.
A parametrized classifier gW predicts the correct labels on top of the features f(xn).
Therefore, the loss function is (Eq. (1)):

where ℓ is the multinomial logistic loss.
This cost function is minimized using mini-batch stochastic gradient descent and backpropagation to compute the gradient.

2. DeepCluster as Pretext Task in Self-Supervised Learning

**Top: k-Means Clustering on Vectors Produced by CNN; Bottom: Using the clustering results as psuedo labels for backpropagation**

2.1. DeepCluster Procedures

The idea of this work is to exploit this weak signal to bootstrap the discriminative power of a convnet.
We cluster the output of the convnet and use the subsequent cluster assignments as “pseudo-labels” to optimize Eq. (1). This deep clustering (DeepCluster) approach iteratively learns the features and groups them.
A standard clustering algorithm, k-means, is used.
k-means takes a set of vectors as input, in our case the features f(xn) produced by the convnet, and clusters them into k distinct groups based on a geometric criterion.
More precisely, it jointly learns a d×k centroid matrix C and the cluster assignments yn of each image n by solving the following problem (Eq. (2)):

Overall, DeepCluster alternates between clustering the features to produce pseudo-labels using Eq. (2) and updating the parameters of the convnet by predicting these pseudo-labels using Eq. (1).

2.2. Avoiding Trivial Solutions

2.2.1. Empty Cluster

An optimal decision boundary is to assign all of the inputs to a single cluster. This issue is caused by the absence of mechanisms to prevent from empty clusters.
More precisely, when a cluster becomes empty, a non-empty cluster is randomly selected and its centroid is used with a small random perturbation as the new centroid for the empty cluster. The points are then reassigned belonging to the non-empty cluster to the two resulting clusters.

2.2.2. Trivial Parametrization

If the vast majority of images is assigned to a few clusters, the parameters θ will exclusively discriminate between them.
A strategy to circumvent this issue is to sample images based on a uniform distribution over the classes, or pseudo-labels.

3. DeepCluster Analysis

3.1. Normalized Mutual Information (NMI)

**(a): Evolution of the clustering quality along training epochs; (b): evolution of cluster reassignments at each clustering step; (c): validation mAP classification performance for various choices of k**

Normalized Mutual Information (NMI), is used to measure the performance:

where I denotes the mutual information and H the entropy.
If the two assignments A and B are independent, the NMI is equal to 0. If one of them is deterministically predictable from the other, the NMI is equal to 1.
(a): The dependence between the clusters and the labels increases over time, showing that the learnt features progressively capture information related to object classes.
(b): The NMI is increasing, meaning that there are less and less reassignments and the clusters are stabilizing over time.
(c): The best performance is obtained with k= 10,000. Given that ImageNet has 1000 classes. Apparently some amount of over-segmentation is beneficial.

3.2. Visualizations

**Filter visualization and top 9 activated images from a subset of 1 million images from YFCC100M**

As expected, deeper layers in the network seem to capture larger textural structures.

**Top 9 activated images from a random subset of 10 millions images from YFCC100M for target filters in the last convolutional layer.**

The filters on the top row contain information about structures that highly correlate with object classes. The filters on the bottom row seem to trigger on style, like drawings or abstract shapes.

4. DeepCluster Performance

4.1. Linear Classification on Activations on ImageNet & Places

**Linear classification on ImageNet and Places using activations from the convolutional layers of an AlexNet as features**

4.1.1. ImageNet

A linear classifier is trained on top of different frozen convolutional layers.
On ImageNet, DeepCluster outperforms the state of the art from conv2 to conv5 layers by 1−6%. The largest improvement is observed in the conv3 layer.

Finally, the difference of performance between DeepCluster and a supervised AlexNe tgrows significantly on higher layers: at layers conv2-conv3 the difference is only around 4%, but this difference rises to 12.3% at conv5, marking where the AlexNet probably stores most of the class level information.

If a MLP is trained on the last layer, DeepCluster outperforms the state of the art by 8%.

4.1.2. Places

DeepCluster yields conv3–4 features that are comparable to those trained with ImageNet labels.

This suggests that when the target task is sufficiently far from the domain covered by ImageNet, labels are less important.

4.2. Pascal VOC

**Comparison of the proposed approach to state-of-the-art unsupervised feature learning on classification, detection and segmentation on Pascal VOC**

DeepCluster outperforms previous unsupervised methods, such as Context Prediction [13], Context Encoders [46], Colorization [71], Split-Brain Auto [72], Jigsaw Puzzles [42], on all three tasks, in every setting.

The improvement with fine-tuning over the state of the art is the largest on semantic segmentation (7.5%).
On detection, DeepCluster performs only slightly better than previously published methods. Interestingly, a fine-tuned random network performs comparatively to many unsupervised methods, but performs poorly if only fc6–8 are learned.

4.3. YFCC100M

**Impact of the training set on the performance of DeepCluster measured on the Pascal VOC transfer tasks**

In YFCC100M, object classes are severly unbalanced, leading to a data distribution less favorable to DeepCluster.
This experiment validates that DeepCluster is robust to a change of image distribution, leading to state-of-the-art general-purpose visual features even if this distribution is not favorable to its design.

4.4. AlexNet vs VGGNet

**Pascal VOC 2007 object detection with** **AlexNet** **and** **VGG**16

In the previous experiment, AlexNet is used. Here a deeper network VGGNet is tried.
Training the VGG-16 with DeepCluster gives a performance above the state of the art, bringing us to only 1.4% below the supervised topline.

4.5. Image Retrieval

**mAP on instance-level image retrieval on Oxford and Paris dataset with a** **VGG**16

The above table suggests that image retrieval is a task where the pre-training is essential and studying it as a down-stream task could give further insights about the quality of the features produced by unsupervised approaches.

One of the major issues is that k-mean clustering takes quite plenty of time.

Reference

[2018 ECCV] [DeepCluster]
Deep Clustering for Unsupervised Learning of Visual Features

Self-Supervised Learning

2008–2010 [Stacked Denoising Autoencoders] 2014 [Exemplar-CNN] 2015 [Context Prediction] 2016 [Context Encoders] [Colorization] [Jigsaw Puzzles] 2017 [L³-Net] [Split-Brain Auto] 2018 [RotNet/Image Rotations] [DeepCluster]