# Brief Review — Local Aggregation for Unsupervised Learning of Visual Embeddings

## Local Aggregation (LA), Self-Supervised Learning By Iterative Clustering & Contrastive Learning

L

ocal Aggregation for Unsupervised Learning of Visual Embeddings,, by Stanford University

Local Aggregation (LA)2019 ICCV, Over 300 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning, Clustering, Image Classification, Object Detection

**Local Aggregation (LA)**is proposed that**trains an embedding function**, causing**similar data instances**to**move together**in the embedding space, while allowing**dissimilar instances**to**separate**. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge.- The embedding function, which is a deep neural network, is then used for downstream tasks.

# Outline

**Local Aggregation (LA)****Results**

**1. Local Aggregation (LA)**

## 1.1. Overall Framework

- A deep neural network
is used to*fθ*with parameters*θ***embed each image***I*={*x*1,*x*2, …,*xN*}**into a lower***D*-dimensional space, i.e. embedding space or features*V*={*v*1,*v*2, …,*vN*}, i.e.*vi*=*fθ*(*xi*). - Then, its
**close neighbors**and*Ci*(blue dots)**background neighbors**are*Bi*(black dots)**identified**. - An
**iterative process**is designed to seek to**push the current embedding vector (red dot) closer to its close neighbors**and**further from its background neighbors.**

Intuitively,

close neighbors are those whose embeddings should be made similar to, whilevibackground neighborsare used toset the distance scalewith respect to which thejudgement of closenessshould be measured.

**Using**, which characterizes the relative level of closeness within*Bi*and*Ci*, the level of local aggregation*L*(*Ci*,*Bi*|*θ*,*xi*) near each input*xi*is defined*Ci*, compared to that in*Bi*.- The parameters
.*fθ*is tuned based on*L*(*Ci*,*Bi*|*θ*,*xi*)

## 1.2. Clustering

is used to*k*-means clustering**cluster all embedded points**to*V*.*m*groups*G*={*G*1,*G*2, …,*Gm*}denotes the*g*(*vi*)**cluster label of**.*vi*- In the
**simplest version**of this procedure,is defined to be the*Ci***set**. However, because clustering can be a*Gg*(*vi*)**noisy**and somewhat arbitrary process.

Multiple times of clusteringare performed to obtainmore stableresults.

- e.g.: another clustering is performed with the above results

- Then,
**multiple clusterings results are union**. - Specifically, let
**{**be*G^*(*j*)}**clusters for**, where*H*distinct clusterings={*G^*(*j*)1 ,*G^*(*j*)_2 , …,*G^*(*j*)_} with*G^*(*j*)_*m^*(*j*)*j*∈{1, 2, …,*H*}, and {*g*^(*j*)} defined accordingly. Then,*Ci*is:

- The number
of clusters and number*m*of clusterings are*H***hyperparameters**of the algorithm.

## 1.3. Local Aggregation Metric

- Following Instance Discrimination, the
**probability**that an arbitrary**feature**to be:*v*is recognized as the*i*-th image

- both
**{**and*vi*}are projected onto the*v***L2-unit sphere**in the*D*-dimensional embedding space (e.g. normalized such that ||*v*||² = 1). - Given an
**image set**, the*A***probability**of**feature**is:*v*being recognized as an image in*A*

- Finally,
is formulated as the*L*(*Ci*,*Bi*|*θ*,*xi*)**negative log-likelihood of***vi***close neighbor (e.g. is in**, given that*Ci*)*vi*is recognized as a**background neighbor (e.g. is in**:*Bi*)

- The loss to be minimized is then:

- Intuitively,
**background neighbors**are an*Bi***unbiased**sample of nearby points that (dynamically) set the scale at which “closeness” should be judged. - Thus, the above loss has similar function as the one used in contrastive learning.

## 1.4. Memory Bank

- The computations involving all the embedded features
*V*, which soon becomes intractable for large datasets. **Following****Instance Discrimination****, a running average for**, which is called the memory bank.*V*is maintained- The memory bank is initialized with random
*D*-dimensional unit vectors and then its values are**updated by mixing ¯**during training:*vi*and*vi*

- where
*t*∈ [0, 1] is a fixed mixing hyperparameter.

# 2. Results

## 2.1. ImageNet

LA significantly outperforms other methodswith all architectures, especially in deeper architectures.

- Using KNN classifiers, LA outperforms the IR task by a large margin with all architectures.
- There is a consistent performance increase for the LA method both from overall deeper architectures, and from earlier layers to deeper layers within an architecture.

## 2.2. Places

The result indicate

strong generalization abilityof the visual representations learned via the LA method.

## 2.3. PASCAL VOC 2007

- Faster R-CNN pipeline is used.

LA method achieves

state-of-the-art unsupervised transfer learningfor the PASCAL detection task.

## 2.4. Analysis

- The LA optimization objective seeks to minimize the distances between
*vi*and*Ci*while maximizing those between*vi*and*Bi*.

Local density of the LA embedding is much higherthan that created by the IR method, while the background density is only slightly higher.

The

successfulexamples show that the LA-trained modelrobustly groups images belonging to the same category regardless of backgrounds and view points. Interestingly, however, the network shows substantial ability to recognize high-level visual context.

- This is even more obvious for the
**failure**cases, where it can be seen that the network**coherently groups images according to salient characteristics**. Failure is mainly due to the inherently**ill-posed nature****of the ImageNet category labelling**.

LA successfully clusters imageswith trombones regardless of background, number of trombones, or viewpoint.

## References

[2019 ICCV] [Local Aggregation (LA)]

Local Aggregation for Unsupervised Learning of Visual Embeddings

[YouTube] https://www.youtube.com/watch?v=nTHZkf8QYzY

## 1.2. Unsupervised/Self-Supervised Learning

**1993** … **2019** [Local Aggregation (LA)] … **2021** [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] **2022** [BEiT] [BEiT V2]