Brief Review — DeeperCluster: Unsupervised Pre-Training of Image Features on Non-Curated Data
DeeperCluster, DeepCluster With RotNet
Unsupervised Pre-Training of Image Features on Non-Curated Datadddde
DeeperCluster, by Facebook AI Research, and Univ. Grenoble Alpes, Inria
2019 ICCV, Over 260 Citations (Sik-Ho Tsang @ Medium)Self-Supervised Learning
1993 … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]
==== My Other Paper Readings Are Also Over Here ====
- DeeperCluster is proposed, which leverages self-supervision and clustering to capture complementary statistics from large-scale data for self-supervised learning.
- This is a paper by the authors of DeepCluster.
Outline
- Preliminaries
- DeeperCluster
- Results
1. Preliminaries
1.1. Self-Supervision Signals, e.g.: RotNet
- A set of N images {x1, …, xN} is given and a pseudo-label yn in Y is assigned to each input xn. In this case, this pseudo-label is the image rotation {0, 90, 180, 270}
Given these pseudo-labels, the parameters of the convet θ are learnt jointly with a linear classifier V to predict pseudo-labels by solving the problem:
- where l is a loss function. The pseudo-labels yn are fixed during the optimization.
1.2. Clustering-Based Approaches, e.g.: DeepCluster
- Clustering-based approaches for deep networks typically build target classes by clustering visual features produced by convnets.
We have a latent pseudo-label zn in Z for each image n as well as a corresponding linear classifier W. These clustering-based methods alternate between learning the parameters θ and W, and updating the pseudo-labels zn. Between two reassignments, the pseudo-labels zn are fixed, and the parameters and classifier are optimized by solving:
- Then, the pseudo-labels zn can be reassigned by minimizing an auxiliary loss function.
- DeepCluster, where latent targets are obtained by clustering the activations with k-means. More precisely, the targets zn are updated by solving the following optimization problem:
- where C is the matrix where each column corresponds to a centroid, k is the number of centroids, and zn is a binary vector with a single non-zero entry. This approach assumes that the number of clusters k is known a priori; in practice, we set it by validation on a downstream task. The latent targets are updated every T epochs.
2. DeeperCluster
2.1. Combining Self-Supervision and Clustering
- In this case, the inputs x1, …, xN are rotated images, each associated with a target label yn encoding its rotation angle and a cluster assignment zn.
- Y is the set of possible rotation angles and Z is the set of possible cluster assignments.
The Cartesian product space Y×Z is used, which can potentially capture richer interactions between the two tasks:
Yet, its complexity is large if the use of a large number of cluster or a self-supervised task is with a large output space.
2.2. Scaling Up to Large Number of Target
The target labels are partitioned into a 2-level hierarchy where we first predict a super-class and then a sub-class among its associated target labels.
The parameters of the linear classifiers are (V, W1, …, WS) and θ are jointly learned by minimizing the following loss function:
- where l is the negative log-softmax function.
- We can see that it is a form of multi-task learning.
2.3. Model Architecture
- VGG-16 with Batch Norm is used, and it is trained on the 96M images from YFCC100M.
3. Results
3.1. PASCAL VOC
The gap with a supervised network is still important when freezing the convolutions (6% for detection and 10% for classification) but drops to less than 5% for both tasks with finetuning.
3.2. ImageNet & Places
DeeperCluster matches the performance of a supervised network for all layers on Places205.
On ImageNet, it also matches supervised features up to the 4th convolutional block.
3.3. Clustering Visualizations
Some clustering visualizations are shown above.