Brief Review — Learning Deep Representations by Mutual Information Estimation and Maximization

Deep InfoMax (DIM) with Global & Local Objectives

Sik-Ho Tsang
5 min readDec 27, 2022


Learning Deep Representations by Mutual Information Estimation and Maximization,
Deep InfoMax (DIM), by Microsoft Research
2019 ICLR, Over 1800 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Contrastive Learning, Image Classification

  • Self-supervised representation learning is based on maximizing mutual information between features extracted from multiple views of a shared context.
  • While multiple views could be produced by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual), an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation.
  • This is a paper from Prof. Bengio research group.


  1. Deep InfoMax (DIM)
  2. Results

1. Deep InfoMax (DIM)

The base encoder model in the context of image data.
  • Let X and Y be the domain, e.g. images, and range of a continuous, e.g. feature vector, and (almost everywhere) differentiable parametric function, : XY with parameters ψ.
  • The encoder should be trained such that the mutual information is maximized: Find the set of parameters, ψ, such that the mutual information, I(X; (X)), is maximized. Depending on the end-goal, this maximization can be done over the complete input, X, or some structured or “local” subset.
  • To maximize MI, a discriminator is trained to classify if the feature vector is real or fake.

1.1. Global DIM: DIM(G)

Deep InfoMax (DIM) with a global MI(X; Y) objective.
  • The above shows Deep InfoMax (DIM) with a global MI(X, Y) objective.
  • Here, Both the high-level feature vector Y, and the lower-level M×M feature map through a discriminator to get the score.
  • Fake samples are drawn by combining the same feature vector with a M×M feature map from another image.

1.1.1. Donsker-Varadhan (DV)

  • One of the approaches follows Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018), which uses a lower-bound to the MI based on the Donsker-Varadhan representation (DV, Donsker & Varadhan, 1983) of the KL-divergence:
  • where : X×Y is a discriminator function modeled by a neural network with parameters ω.
  • At a high level, E is optimized by simultaneously estimating and maximizing I(X; (X)):
  • where the subscript G denotes “global”.

1.1.2. Jensen-Shannon Divergence (JSD)

  • Jensen-Shannon MI estimator (following the formulation of Nowozin et al., 2016):
  • where x is an input sample, x’ is an input sampled from ~P=P, and sp(z)=log(1+e^z) is the softplus function.

1.1.3. Noise-Contrastive Estimation (NCE)

  • It is found that using InfoNCE often outperforms JSD on downstream tasks.

1.2. Local DIM: DIM(L)

Maximizing mutual information between local features and global features.
  • First, the image is encoded to a feature map (x) that reflects some structural aspect of the data, e.g. spatial locality, and this feature map is further summarized into a global feature vector.
  • Then, this feature vector is concatenated with the lower-level feature map at every location.
  • A score is produced for each local-global pair through an additional function.
  • MI estimator is performed on global/local pairs, maximizing the average estimated MI:

1.3. Prior Matching

Matching the output of the encoder to a prior.
  • “Real” samples are drawn from a prior while “fake” samples from the encoder output are sent to a discriminator.
  • The discriminator is trained to distinguish between (classify) these sets of samples. The encoder is trained to “fool” the discriminator.
  • With prior matching, encoder can generate features that close to prior.

1.4. Complete Objective

  • All three objectives — global and local MI maximization and prior matching — can be used together, as the complete objective for Deep InfoMax (DIM):

2. Results

Classification accuracy (top 1) results on CIFAR10 and CIFAR100.
Classification accuracy (top 1) results on Tiny ImageNet and STL-10.

In general, DIM with the local objective, DIM(L), outperformed all models presented here by a significant margin on all datasets.

Among DV, JSD & InfoNCE, InfoNCE tends to perform best.

Comparisons of DIM with Contrastive Predictive Coding (CPC).

DIM(L) is competitive with CPC using InfoNCE.

This is an early paper for self-supervised learning. Many things are tried here.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.