Brief Review — Learning Deep Representations by Mutual Information Estimation and Maximization
Learning Deep Representations by Mutual Information Estimation and Maximization,
Deep InfoMax (DIM), by Microsoft Research
2019 ICLR, Over 1800 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Contrastive Learning, Image Classification
- Self-supervised representation learning is based on maximizing mutual information between features extracted from multiple views of a shared context.
- While multiple views could be produced by observing it from different locations (e.g., camera positions within a scene), and via different modalities (e.g., tactile, auditory, or visual), an ImageNet image could provide a context from which one produces multiple views by repeatedly applying data augmentation.
- This is a paper from Prof. Bengio research group.
- Deep InfoMax (DIM)
1. Deep InfoMax (DIM)
- Let X and Y be the domain, e.g. images, and range of a continuous, e.g. feature vector, and (almost everywhere) differentiable parametric function, Eψ: X→Y with parameters ψ.
- The encoder should be trained such that the mutual information is maximized: Find the set of parameters, ψ, such that the mutual information, I(X; Eψ(X)), is maximized. Depending on the end-goal, this maximization can be done over the complete input, X, or some structured or “local” subset.
- To maximize MI, a discriminator is trained to classify if the feature vector is real or fake.
1.1. Global DIM: DIM(G)
- The above shows Deep InfoMax (DIM) with a global MI(X, Y) objective.
- Here, Both the high-level feature vector Y, and the lower-level M×M feature map through a discriminator to get the score.
- Fake samples are drawn by combining the same feature vector with a M×M feature map from another image.
1.1.1. Donsker-Varadhan (DV)
- One of the approaches follows Mutual Information Neural Estimation (MINE) (Belghazi et al., 2018), which uses a lower-bound to the MI based on the Donsker-Varadhan representation (DV, Donsker & Varadhan, 1983) of the KL-divergence:
- where Tω: X×Y is a discriminator function modeled by a neural network with parameters ω.
- At a high level, E is optimized by simultaneously estimating and maximizing I(X; EΨ(X)):
- where the subscript G denotes “global”.
1.1.2. Jensen-Shannon Divergence (JSD)
- Jensen-Shannon MI estimator (following the formulation of Nowozin et al., 2016):
- where x is an input sample, x’ is an input sampled from ~P=P, and sp(z)=log(1+e^z) is the softplus function.
1.1.3. Noise-Contrastive Estimation (NCE)
- It is found that using InfoNCE often outperforms JSD on downstream tasks.
1.2. Local DIM: DIM(L)
- First, the image is encoded to a feature map CΨ(x) that reflects some structural aspect of the data, e.g. spatial locality, and this feature map is further summarized into a global feature vector.
- Then, this feature vector is concatenated with the lower-level feature map at every location.
- A score is produced for each local-global pair through an additional function.
- MI estimator is performed on global/local pairs, maximizing the average estimated MI:
1.3. Prior Matching
- “Real” samples are drawn from a prior while “fake” samples from the encoder output are sent to a discriminator.
- The discriminator is trained to distinguish between (classify) these sets of samples. The encoder is trained to “fool” the discriminator.
- With prior matching, encoder can generate features that close to prior.
1.4. Complete Objective
- All three objectives — global and local MI maximization and prior matching — can be used together, as the complete objective for Deep InfoMax (DIM):
This is an early paper for self-supervised learning. Many things are tried here.
[2019 ICLR] [Deep InfoMax (DIM)]
Learning Deep Representations by Mutual Information Estimation and Maximization