# Review — Learning Representations by Maximizing Mutual Information Across Views

## Augmented Multiscale DIM (AMDIM), **Extends ****Deep InfoMax (DIM)**

Learning Representations by Maximizing Mutual Information Across Views,AMDIM, by Microsoft Research2019 NeurIPS, Over 900 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning, Image Classification

- Self-supervised learning is proposed based on
**maximizing mutual information between features extracted from multiple views of a shared context**, where multiple views can be different locations (e.g., camera positions within a scene), different modalities (e.g., tactile, auditory, or visual), or applying different data augmentations. - This work further
**extends****Deep InfoMax (DIM)**.

# Outline

**Augmented Multiscale DIM (AMDIM)****Results**

**1. Augmented Multiscale DIM (AMDIM)**

## 1.1. Local DIM

**AMDIM extends****Local DIM****.**- Local DIM
**maximizes mutual information**between**global features**, produced by*f*1(*x*)**a convolutional encoder**, and*f***local features {**, produced by an*f*7(*x*)*ij*: ∀*i*,*j*}**intermediate layer in**.*f*

The proposed AMDIMrefers to thefeatures that encode the data to condition on (global features)asantecedent features, and thefeatures to be predicted (local features)asconsequent features.

## 1.2. **Noise Contrastive Estimation (NCE)**

- Local DIM tried several optimization approaches: DV, JSD and NCE.
**The best results with****Local DIM**were obtained using a**mutual information bound based on****Noise Contrastive Estimation (NCE)****or****InfoNCE in CPC****,**maximize the**NCE lower bound on**by minimizing the following loss:*I*(*f*1(*x*),*f*7(*x*)*ij*)

- where the
**positive sample pair (**is drawn from the joint distribution*f*1(*x*),*f*7(*x*)*ij*)*p*(*f*1(*x*),*f*7(*x*)*ij*).denotes*N*7**a set of negative samples**, comprising many “distractor” consequent features drawn independently from the marginal distribution.*p*(*f*7(*x*)*ij*) - The
**loss**is a*LΦ***standard log-softmax**, where the normalization is over**a large set of matching scores**:*Φ*(*f*1,*f*7)

## 1.3. Efficient NCE Computation

- To compute the MI bound for many positive sample pairs, using
**large negative sample sets, e.g. |**, by using a*N*7|>>10000**simple dot product**for the**matching score**:

- where the
**functions**.*φ*1/*φ*7 non-linearly transform their inputs to some other vector space - The above equations actually are the
**contrastive learning**commonly used nowadays, such as in SimCLR. - Some
**tricks**are used to**mitigate occasional instability**in the NCE cost.

- The
**first trick**is to add a**weighted regularization term**that**penalizes the squared matching scores**. - The
**second trick**is to apply a**soft clipping**non-linearity to the scores.

## 1.4. Data Augmentation

- AMDIM extends Local DIM by
**maximizing mutual information between features from augmented views of each input**. - The
**NCE****rewritten**to**include prediction across data augmentation**:

- where
**sample augmented images**areand*x¹*~*A*(*x*), and*x²*~*A*(*x*)*A*is stochastic data augmentation: Random resized crop, random jitter in color space, random conversion to grayscale, and a random horizontal flip.

## 1.5. Multiscale Mutual Information

- AMDIM
**further extends****Local DIM**by maximizing mutual information**across multiple feature scales**. **A family of**is defined:*n*-to-*m*infomax costs

- where
*n*and*m*are features from different layers.**In AMDIM, mutual information is maximized from 1-to-5, 1-to-7, and 5-to-5.**

## 1.6. Encoder

- The encoder is the
**standard ResNet**. - The models are trained using 4–8 standard Tesla V100 GPUs per model.

## 1.7. Mixture-Based Representations

- AMDIM is extended to use mixture-based features.
- For each
**antecedent feature**, a set of*f*1**mixture features {**is computed, where*f*11, …,*fk*1}*k*is the number of mixture components. - And
**{**, where*f*11, …,*fk*1}=*mk*(*f*1)*mk*(.) is a fully-connected network with a single ReLU hidden layer. - When using mixture features, the following objective is maximized:

- where
*αH*(*q*) is the entropy maximization term and**the optimal distribution**is used as follows:*q*

- where
*τ*is a temperature parameter, originated from Distillation, that controls the entropy of*q*.

# 2. Results

**(a)**AMDIM alrge models are trained for 150 epochs on 8 NVIDIA Tesla V100 GPUs. Small model is trained using a shorter 50 epoch schedule, it achieves 62.7% accuracy in 2 days on 4 GPUs.

(a): AMDIMoutperformsprior and concurrent methods by alarge margin.(b): AMDIM features performedon par with classic fully-supervised models.(c): Thestrongestresults used theFast AutoAugmentaugmentation policy.

The above stuffs, such as NCE calculation and multiple augmented view generation, are commonly used in self-supervised learning.

## Reference

[2019 NeurIPS] [AMDIM]

Learning Representations by Maximizing Mutual Information Across Views

## 1.2. Self-Supervised Learning

**1993** … **2019 **[AMDIM] … **2022** [BEiT] [BEiT V2] [Masked Autoencoders (MAE)]