Brief Review — Local Aggregation for Unsupervised Learning of Visual Embeddings

Local Aggregation (LA), Self-Supervised Learning By Iterative Clustering & Contrastive Learning

5 min readJan 12, 2023

Local Aggregation for Unsupervised Learning of Visual Embeddings,
Local Aggregation (LA), by Stanford University
2019 ICCV, Over 300 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Clustering, Image Classification, Object Detection

Local Aggregation (LA) is proposed that trains an embedding function, causing similar data instances to move together in the embedding space, while allowing dissimilar instances to separate. This aggregation metric is dynamic, allowing soft clusters of different scales to emerge.
The embedding function, which is a deep neural network, is then used for downstream tasks.

Outline

Local Aggregation (LA)
Results

1. Local Aggregation (LA)

1.1. Overall Framework

A deep neural network fθ with parameters θ is used to embed each image I={x1, x2, …, xN} into a lower D-dimensional space, i.e. embedding space or features V={v1, v2, …, vN}, i.e. vi=fθ(xi).
Then, its close neighbors Ci (blue dots) and background neighbors Bi (black dots) are identified.
An iterative process is designed to seek to push the current embedding vector (red dot) closer to its close neighbors and further from its background neighbors.

Intuitively, close neighbors are those whose embeddings should be made similar to vi, while background neighbors are used to set the distance scale with respect to which the judgement of closeness should be measured.

Using Bi and Ci, the level of local aggregation L(Ci, Bi|θ, xi) near each input xi is defined, which characterizes the relative level of closeness within Ci, compared to that in Bi.
The parameters fθ is tuned based on L(Ci, Bi|θ, xi).

1.2. Clustering

**Cluster into m groups of G={G1, G2, …, Gm}** (Figure from https://www.youtube.com/watch?v=nTHZkf8QYzY)

k-means clustering is used to cluster all embedded points V to m groups G={G1, G2, …, Gm}.
g(vi) denotes the cluster label of vi.
In the simplest version of this procedure, Ci is defined to be the set Gg(vi). However, because clustering can be a noisy and somewhat arbitrary process.

Multiple times of clustering are performed to obtain more stable results.

**Another Clustering** (Figure from https://www.youtube.com/watch?v=nTHZkf8QYzY)

e.g.: another clustering is performed with the above results

**Union of multiple clustering results** (Figure from https://www.youtube.com/watch?v=nTHZkf8QYzY)

Then, multiple clusterings results are union.
Specifically, let {G^(j)} be clusters for H distinct clusterings, where G^(j)={G^(j)_1 ,G^(j)_2 , …,G^(j)_m^(j)} with j∈{1, 2, …, H}, and {g^(j)} defined accordingly. Then, Ci is:

The number m of clusters and number H of clusterings are hyperparameters of the algorithm.

1.3. Local Aggregation Metric

Following Instance Discrimination, the probability that an arbitrary feature v is recognized as the i-th image to be:

both {vi} and v are projected onto the L2-unit sphere in the D-dimensional embedding space (e.g. normalized such that ||v||² = 1).
Given an image set A, the probability of feature v being recognized as an image in A is:

Finally, L(Ci, Bi|θ, xi) is formulated as the negative log-likelihood of vi being recognized as a close neighbor (e.g. is in Ci), given that vi is recognized as a background neighbor (e.g. is in Bi):

The loss to be minimized is then:

Intuitively, background neighbors Bi are an unbiased sample of nearby points that (dynamically) set the scale at which “closeness” should be judged.
Thus, the above loss has similar function as the one used in contrastive learning.

1.4. Memory Bank

The computations involving all the embedded features V, which soon becomes intractable for large datasets.
Following Instance Discrimination, a running average for V is maintained, which is called the memory bank.
The memory bank is initialized with random D-dimensional unit vectors and then its values are updated by mixing ¯vi and vi during training:

where t ∈ [0, 1] is a fixed mixing hyperparameter.

2. Results

2.1. ImageNet

**ImageNet transfer learning and KNN classifier performance**

LA significantly outperforms other methods with all architectures, especially in deeper architectures.

Using KNN classifiers, LA outperforms the IR task by a large margin with all architectures.
There is a consistent performance increase for the LA method both from overall deeper architectures, and from earlier layers to deeper layers within an architecture.

2.2. Places

**Places transfer learning performance**

The result indicate strong generalization ability of the visual representations learned via the LA method.

2.3. PASCAL VOC 2007

Faster R-CNN pipeline is used.

LA method achieves state-of-the-art unsupervised transfer learning for the PASCAL detection task.

2.4. Analysis

**Distributions across all ImageNet training images of local and background densities for feature embeddings**

The LA optimization objective seeks to minimize the distances between vi and Ci while maximizing those between vi and Bi.

Local density of the LA embedding is much higher than that created by the IR method, while the background density is only slightly higher.

**Nearest neighbors in LA-trained RestNet-50 embedding**

The successful examples show that the LA-trained model robustly groups images belonging to the same category regardless of backgrounds and view points. Interestingly, however, the network shows substantial ability to recognize high-level visual context.

This is even more obvious for the failure cases, where it can be seen that the network coherently groups images according to salient characteristics. Failure is mainly due to the inherently ill-posed nature of the ImageNet category labelling.