Brief Review — Learning Deep Representations by Mutual Information Estimation and Maximization

Deep InfoMax (DIM) with Global & Local Objectives

5 min readDec 27, 2022

Learning Deep Representations by Mutual Information Estimation and Maximization,
Deep InfoMax (DIM), by Microsoft Research
2019 ICLR, Over 1800 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Contrastive Learning, Image Classification

Self-supervised representation learning is based on maximizing mutual information between features extracted from multiple views of a shared context.
While multiple views could be produced by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual), an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation.
This is a paper from Prof. Bengio research group.

Outline

Deep InfoMax (DIM)
Results

1. Deep InfoMax (DIM)

**The base encoder model in the context of image data.**

Let X and Y be the domain, e.g. images, and range of a continuous, e.g. feature vector, and (almost everywhere) differentiable parametric function, Eψ: X→Y with parameters ψ.
The encoder should be trained such that the mutual information is maximized: Find the set of parameters, ψ, such that the mutual information, I(X; Eψ(X)), is maximized. Depending on the end-goal, this maximization can be done over the complete input, X, or some structured or “local” subset.
To maximize MI, a discriminator is trained to classify if the feature vector is real or fake.

1.1. Global DIM: DIM(G)

**Deep InfoMax (DIM) with a global MI(X; Y) objective.**

The above shows Deep InfoMax (DIM) with a global MI(X, Y) objective.
Here, Both the high-level feature vector Y, and the lower-level M×M feature map through a discriminator to get the score.
Fake samples are drawn by combining the same feature vector with a M×M feature map from another image.

1.1.1. Donsker-Varadhan (DV)

One of the approaches follows Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018), which uses a lower-bound to the MI based on the Donsker-Varadhan representation (DV, Donsker & Varadhan, 1983) of the KL-divergence:

where Tω: X×Y is a discriminator function modeled by a neural network with parameters ω.
At a high level, E is optimized by simultaneously estimating and maximizing I(X; EΨ(X)):

where the subscript G denotes “global”.

1.1.2. Jensen-Shannon Divergence (JSD)

Jensen-Shannon MI estimator (following the formulation of Nowozin et al., 2016):

where x is an input sample, x’ is an input sampled from ~P=P, and sp(z)=log(1+e^z) is the softplus function.

1.1.3. Noise-Contrastive Estimation (NCE)

Similar to NCE or InfoNCE in CPC, this loss can also be used with DIM by maximizing:

It is found that using InfoNCE often outperforms JSD on downstream tasks.

1.2. Local DIM: DIM(L)

**Maximizing mutual information between local features and global features.**

First, the image is encoded to a feature map CΨ(x) that reflects some structural aspect of the data, e.g. spatial locality, and this feature map is further summarized into a global feature vector.
Then, this feature vector is concatenated with the lower-level feature map at every location.
A score is produced for each local-global pair through an additional function.
MI estimator is performed on global/local pairs, maximizing the average estimated MI:

1.3. Prior Matching

**Matching the output of the encoder to a prior.**

“Real” samples are drawn from a prior while “fake” samples from the encoder output are sent to a discriminator.
The discriminator is trained to distinguish between (classify) these sets of samples. The encoder is trained to “fool” the discriminator.

With prior matching, encoder can generate features that close to prior.

1.4. Complete Objective

All three objectives — global and local MI maximization and prior matching — can be used together, as the complete objective for Deep InfoMax (DIM):

2. Results

**Classification accuracy (top 1) results on CIFAR10 and CIFAR100.**

**Classification accuracy (top 1) results on Tiny ImageNet and STL-10.**

In general, DIM with the local objective, DIM(L), outperformed all models presented here by a significant margin on all datasets.
Among DV, JSD & InfoNCE, InfoNCE tends to perform best.