# Review — DrLIM: Dimensionality Reduction by Learning an Invariant Mapping

## DrLIM: Contrastive Learning for Dimensionality Reduction

In this story, **Dimensionality Reduction by Learning an Invariant Mapping**, (DrLIM), by New York University, is reviewed. This is a paper by Prof. LeCun. Originally, this method is proposed for Face Recognition in 2005 CVPR. In this paper:

- A method called
**Dimensionality Reduction by Learning an Invariant Mapping (DrLIM)**is used to**map the data evenly to the output manifold.** - The learning
**relies solely on neighborhood relationships**and does not require any distance measure in the input space.

This is a paper in **2006 CVPR **with over **2800 citations**. (

**contrastive learning**, and contrastive learning is one key knowledge to know for

**self-supervised learning**.

# Outline

**The Contrastive Loss Function in General Form****The Contrastive Loss Function in Exact Form****Spring Model Analogy****Network Architecture for GW****Experimental Results**

# 1. The Contrastive Loss Function in General Form

- Consider the set
*I*of high dimensional training vectors ~*Xi*, there is a set*S*~*Xi*of training vectors that are deemed similar to ~*Xi*. - Let
*Y*be a binary label assigned to this pair. *Y*=0 if ~*X*1 and ~*X*2 are deemed similar.*Y*=1 if they are deemed dissimilar.- Define
**the parameterized distance function to be learned**between ~*DW**X*1, ~*X*2 as the euclidean distance between the outputs of*GW*:

- where
and*GW*is convolutional neural network (CNN) for MNIST**2-layer fully connected layer network for airplane images in NORB dataset**. - Then the loss function in its
**most general form**is:

- where
**(**is the*Y*, ~*X*1, ~*X*2)^*i*.*i*-th labeled sample pair is the*LS***partial loss function**for a pair of**similar**points.is the*LD***partial loss function**for a pair of**dissimilar**points,is the*P***number of training pairs**(which may be as large as the square of the number of samples).

LSandLDmust be designed such that minimizingLwith respect toWwould result inlow values ofandDWfor similar pairshigh values ofDWfor dissimilar pairs.

**2. The Contrastive Loss Function in Exact Form**

## 2.1. Exact Loss Function

- The
**exact loss function**is:

- where
is a*m*>0**margin**. The margin defines a radius around*GW*(~*X*).

The contrastive term involving dissimilar pairs,LD, is crucial.Simply minimizingover the set of all similar pairs will usuallyDW( ~X1, ~X2)lead to a collapsed solution.

- Most energy-based models require the use of an explicit
**contrastive term**in the loss function.

## 2.2. Training

- To train, prior knowledge is used to pair the samples as similar samples and dissimilar samples.
**All the pairs are combined to form the labeled training set.**- The above overall loss function is used to update
*W*for each sample pair.

# 3. Spring Model Analogy

- The outputs of
*GW*can be thought of as masses**attracting**and**repelling**each other with springs. Consider the equation of a spring:

- where
*F*is the force,*K*is the spring constant and*X*is the displacement of the spring from its rest length.

## 3.1. Attracting Force

- A spring is attract-only if its rest length is equal to zero.
**Attractive force is applied to similar pairs (blue & black points).**- The
**loss function**associated with*LS*(*W*, ~*X*1, ~*X*2)**similar pairs**:

## 3.2. Repelling Force

- A spring is said to be
if its rest length is equal to*m*-repulse-only*m*. - Thus two points that are connected with a m-repulse-only spring will be
**pushed apart if**.*X*is less than*m* - However this spring has
**a special property that if the spring is stretched by a length**, then*X*>*m***no attractive force brings it back**to rest length. - The
**partial loss function**:*LD*

- The force is maximum when
*DW*=0 and absent when*DW*=*m*.

## 3.3. **Equilibrium**

Each point is connected by attract-only springs to similar points, and is connected by

m-repulse-only spring to dissimilar points.

# 4. Network Architecture for GW

- First of all,
**Siamese network**is used.

## 4.1. Network Architecture for MNIST

- To be brief, there are
**2 convolutions**. In between, there is**a subsampling layer**. At the end, there is**a fully connected layer**. - The network is similar to LeNet.

## 4.2. Network Architecture for Airplane Images in NORB Dataset

**2-layer fully connected layer network**is used.- The number of
**hidden**and**output units**used was**20**and**3**respectively.

# 5. Experimental Results

## 5.1. MNIST

- The training set is built from 3000 images of the handwritten
**digit 4**and 3000 images of the handwritten**digit 9**chosen randomly from the MNIST dataset. - Approximately 1000 images of each digit comprised the test set.
- Each sample ~
*Xi*was paired with its 5 nearest neighbors. - All other possible pairs were labeled dissimilar, producing
**30,000 similar pairs**and on the order of**18 million dissimilar pairs**. - The mapping of the test set to a 2D manifold is shown above.
**The lighter-colored blue dots**are**9’s**and**the darker-colored red dots**are**4’s**.

An overall organization that is primarily determined by the slant angle of the samples. The samples are spread rather uniformly in the populated region.

## 5.2. MNIST Distorted by Adding Samples that have been Horizontally Translated

## 5.2.1. **DrLIM**

- In the distorted set, 3000 images of 4’s and 3000 images of 9’s are
**horizontally translated by -6, -3, 3, and 6 pixels**and combined with the originals, producing a total of 30,000 samples. - The 2000 samples in the test set were distorted in the same way.

The output points are clusteredaccording to the translated position of the input sample. Within each cluster, however, the samples are well organized and evenly distributed.

## 5.2.2. LLE

- For comparison, the LLE algorithm was used.

There is

no global coherencein the embedding.

## 5.2.3. **DrLIM Considering Translation**

- Each sample paired with (a) its
**5 nearest neighbors**, (b) its**4 translations**, and (c) the**4 translations of each of its 5 nearest neighbors**. - As desired, there is no organization on the basis of translation; in fact, translated versions of a given character are all
**tightly packed in small regions on the manifold.**

Similar characters are

mapped to nearby areas, regardless of their shift.

## 5.3. Airplane Images in DORB Dataset

Clearly, the 3D embedding by LLE is

NOT invariant to lighting, andthe organization of azimuth and elevation does not reflect the real topology neighborhood graph.

- The manifold is
**roughly cylindrical**with a systematic organization: along**the circumference varies azimuth of camera**in the viewing half-sphere. **Along the height varies the camera elevation**in the viewing sphere.- The mapping is
**invariant to the lighting condition**, thanks to the prior knowledge built into the neighborhood relationships.

I read this paper because I would like to read about the **contrastive learning**.

## Reference

[2006 CVPR] [DrLIM]

Dimensionality Reduction by Learning an Invariant Mapping

## Data Visualization

**2002** [SNE] **2006 **[Autoencoder] [DrLIM] **2007 **[UNI-SNE] **2008 **[t-SNE]