# Brief Review — Learning Deep Representations by Mutual Information Estimation and Maximization

**Deep InfoMax (DIM) with Global & Local Objectives**

Learning Deep Representations by Mutual Information Estimation and Maximization,Deep InfoMax (DIM), by Microsoft Research2019 ICLR, Over 1800 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning, Contrastive Learning, Image Classification

- Self-supervised representation learning is based on
**maximizing mutual information between features extracted from multiple views of a shared context.** - While multiple views could be produced by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual),
**an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation.** - This is a paper from Prof. Bengio research group.

# Outline

**Deep InfoMax (DIM)****Results**

**1. Deep InfoMax (DIM)**

- Let
and*X*be the*Y***domain, e.g. images,**and**range of a continuous, e.g. feature vector,**and (almost everywhere)**differentiable parametric function,**:*Eψ**X*→*Y*with parameters*ψ.* - The
**encoder**should be**trained such that the mutual information is maximized**: Find the set of parameters,, such that the*ψ***mutual information**,, is*I*(*X*;*Eψ*(*X*))**maximized**. Depending on the end-goal, this maximization can be done over the**complete input,**, or some structured*X***or “local” subset**. - To maximize MI, a discriminator is trained to classify if the feature vector is real or fake.

## 1.1. Global DIM: DIM(G)

- The above shows
**Deep InfoMax (DIM)**with a**global MI(**.*X*,*Y*) objective - Here, Both the
**high-level feature vector**, and the*Y***lower-level**through a*M*×*M*feature map**discriminator**to**get the score**. **Fake samples**are drawn by**combining the same feature vector with a M×M feature map from another image.**

**1.1.1. Donsker-Varadhan (DV)**

- One of the approaches follows
**Mutual Information Neural Estimation (MINE)**(Belghazi et al., 2018), which uses a**lower-bound to the MI**based on the**Donsker-Varadhan representation**(**DV**, Donsker & Varadhan, 1983) of the KL-divergence:

- where
: X*Tω***×**Y is a**discriminator**function modeled by a neural network with parameters*ω*. - At a high level,
:*E*is optimized by simultaneously estimating and maximizing*I*(*X*;*EΨ*(*X*))

- where the subscript
denotes*G***“global”.**

## 1.1.2. Jensen-Shannon Divergence (JSD)

**Jensen-Shannon MI estimator**(following the formulation of Nowozin et al., 2016):

- where
*x*is an input sample,*x*’ is an input sampled from ~*P*=*P*, and sp(*z*)=log(1+*e*^*z*) is the softplus function.

## 1.1.3. Noise-Contrastive Estimation (NCE)

**Similar to****NCE****or****InfoNCE in CPC**, this loss can also be used with DIM by maximizing:

- It is found that using
**InfoNCE****often outperforms JSD**on downstream tasks.

## 1.2. Local DIM: DIM(L)

- First,
**the image is encoded to a feature map**that reflects some structural aspect of the data, e.g. spatial locality, and this feature map is further summarized into a*CΨ*(*x*)**global feature vector.** - Then, this feature vector is
**concatenated with the lower-level feature map at every location**. - A
**score**is produced for**each local-global pair**through an additional function. - MI estimator is performed on global/local pairs, maximizing the average estimated MI:

## 1.3. Prior Matching

**“Real” samples are drawn from a prior**while**“fake” samples from the encoder output**are sent to a discriminator.- The discriminator is trained to distinguish between (classify) these sets of samples. The encoder is trained to “fool” the discriminator.

- With prior matching,
**encoder can generate features that close to prior.**

## 1.4. Complete Objective

**All three objectives —****global and local MI maximization and prior matching**— can be**used together**, as the complete objective for Deep InfoMax (DIM):

# 2. Results

In general,

DIM with the local objective, DIM(L), outperformed all modelspresented here by a significant margin on all datasets.

DIM(L) is competitive withCPC using InfoNCE.

This is an early paper for self-supervised learning. Many things are tried here.

## Reference

[2019 ICLR] [Deep InfoMax (DIM)]

Learning Deep Representations by Mutual Information Estimation and Maximization

## 1.2. Unsupervised/Self-Supervised Learning

**1993** … **2021** [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [Barlow Twins] [W-MSE] [SimSiam+AL] [BYOL+LP] **2022** [BEiT] [BEiT V2] [Masked Autoencoders (MAE)]