# Brief Review — DeeperCluster: Unsupervised Pre-Training of Image Features on Non-Curated Data

## DeeperCluster, DeepCluster With RotNet

--

Unsupervised Pre-Training of Image Features on Non-Curated Datadddde, by Facebook AI Research, and Univ. Grenoble Alpes, Inria

DeeperCluster2019 ICCV, Over 260 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning1993…2022[BEiT] [BEiT V2] [Masked Autoencoders (MAE)] [DiT] [SimMIM]

==== My Other Paper Readings Are Also Over Here ====

**DeeperCluster**is proposed, which**leverages self-supervision and****clustering**to capture complementary statistics from large-scale data for self-supervised learning.- This is a paper by the authors of DeepCluster.

# Outline

**Preliminaries****DeeperCluster****Results**

**1. Preliminaries**

## 1.1. Self-Supervision Signals, e.g.: **RotNet**

- A set of
is given and a*N*images {*x*1, …,*xN*}**pseudo-label**is assigned to each input*yn*in*Y**xn*. In this case, this pseudo-label is the**image rotation {0, 90, 180, 270}**

Given these pseudo-labels, the parameters of the

convetare learnt jointly with aθlinear classifierby solving the problem:Vto predict pseudo-labels

- where
*l*is a loss function.**The pseudo-labels**during the optimization.*yn*are fixed

## 1.2. Clustering-Based Approaches, e.g.: DeepCluster

- Clustering-based approaches for deep networks typically
**build target classes by clustering visual features**produced by convnets.

We have

a latent pseudo-labelas well as a correspondingzninZfor each imagenlinear classifier. These clustering-based methodsWalternate between learning the parametersBetween two reassignments, the pseudo-labelsθandW, and updating the pseudo-labelszn.znare fixed, and the parameters and classifier are optimized by solving:

- Then, the pseudo-labels
*zn*can be reassigned by minimizing an auxiliary loss function. - DeepCluster, where latent targets are obtained by clustering the activations with k-means. More precisely,
**the targets**:*zn*are updated by solving the following optimization problem

- where
is the*C***matrix**where**each column corresponds to a centroid**,is the*k***number of centroids**, andis a*zn***binary vector with a single non-zero entry.**This approach assumes that**the number of clusters**; in practice, we set it by validation on a downstream task.*k*is known a priori**The latent targets are updated every***T*epochs.

# 2. DeeperCluster

## 2.1. Combining Self-Supervision and Clustering

- In this case, the
**inputs**are*x*1, …,*xN***rotated images, each associated with a target label**encoding its*yn***rotation angle**and**cluster assignment***zn*. is*Y***the set of possible rotation angles**andis*Z***the set of possible cluster assignments.**

The Cartesian product space Y×Zis used, which canpotentially capture richer interactions between the two tasks:

Yet, its

complexity is large ifthe use of a large number of cluster or a self-supervised task is with alarge output space.

## 2.2. Scaling Up to Large Number of Target

The

target labelsare partitioned into a2-level hierarchywhere wefirst predict a super-class and then a sub-classamong its associated target labels.The parameters of the

linearclassifiersare(andV,W1, …,WS)areθjointly learned by minimizing the following loss function:

- where
is the*l***negative log-softmax function.** - We can see that it is a form of
**multi-task learning.**

## 2.3. Model Architecture

**VGG****-16 with****Batch Norm**is used, and it is**trained on the 96M images from YFCC100M.**

# 3. Results

## 3.1. PASCAL VOC

The gap with a supervised networkis still important when freezing the convolutions (6% for detection and 10% for classification) but drops toless than 5% for both tasks with finetuning.

## 3.2. ImageNet & Places

DeeperCluster matches the performance of a supervised networkfor all layers onPlaces205.On

ImageNet, it alsomatches supervised features up to the 4th convolutional block.

## 3.3. Clustering Visualizations

Some clustering visualizationsare shown above.