Review — Learning Representations by Maximizing Mutual Information Across Views

Augmented Multiscale DIM (AMDIM), Extends Deep InfoMax (DIM)

5 min readJan 4, 2023

Learning Representations by Maximizing Mutual Information Across Views, AMDIM, by Microsoft Research
2019 NeurIPS, Over 900 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification

Self-supervised learning is proposed based on maximizing mutual information between features extracted from multiple views of a shared context, where multiple views can be different locations (e.g., camera positions within a scene), different modalities (e.g., tactile, auditory, or visual), or applying different data augmentations.
This work further extends Deep InfoMax (DIM).

Outline

Augmented Multiscale DIM (AMDIM)
Results

1. Augmented Multiscale DIM (AMDIM)

**(a):** **Local DIM** with predictions across views generated by data augmentation. **(b): Augmented Multiscale DIM (AMDIM)**, with multiscale infomax across views generated by data augmentation. **(c)-top: Efficient NCE** with minibatches of na imagescomprising one antecedent
and nc consequents per image. **(c)-bottom**: **ImageNet encoder** architecture.

1.1. Local DIM

AMDIM extends Local DIM.
Local DIM maximizes mutual information between global features f1(x), produced by a convolutional encoder f, and local features {f7(x)ij: ∀ i, j}, produced by an intermediate layer in f.

The proposed AMDIM refers to the features that encode the data to condition on (global features) as antecedent features, and the features to be predicted (local features) as consequent features.

1.2. Noise Contrastive Estimation (NCE)

Local DIM tried several optimization approaches: DV, JSD and NCE.
The best results with Local DIM were obtained using a mutual information bound based on Noise Contrastive Estimation (NCE) or InfoNCE in CPC, maximize the NCE lower bound on I(f1(x), f7(x)ij) by minimizing the following loss:

where the positive sample pair (f1(x), f7(x)ij) is drawn from the joint distribution p(f1(x), f7(x)ij). N7 denotes a set of negative samples, comprising many “distractor” consequent features drawn independently from the marginal distribution p(f7(x)ij).
The loss LΦ is a standard log-softmax, where the normalization is over a large set of matching scores Φ(f1, f7):

1.3. Efficient NCE Computation

To compute the MI bound for many positive sample pairs, using large negative sample sets, e.g. |N7|>>10000, by using a simple dot product for the matching score:

where the functions φ1/φ7 non-linearly transform their inputs to some other vector space.
The above equations actually are the contrastive learning commonly used nowadays, such as in SimCLR.
Some tricks are used to mitigate occasional instability in the NCE cost.

The first trick is to add a weighted regularization term that penalizes the squared matching scores.
The second trick is to apply a soft clipping non-linearity to the scores.

1.4. Data Augmentation

AMDIM extends Local DIM by maximizing mutual information between features from augmented views of each input.
The NCE infomax objective is rewritten to include prediction across data augmentation:

where sample augmented images are x¹~A(x) and x²~A(x), and A is stochastic data augmentation: Random resized crop, random jitter in color space, random conversion to grayscale, and a random horizontal flip.

1.5. Multiscale Mutual Information

**Augmented Multiscale Deep InfoMax (AMDIM)**

AMDIM further extends Local DIM by maximizing mutual information across multiple feature scales.
A family of n-to-m infomax costs is defined:

where n and m are features from different layers. In AMDIM, mutual information is maximized from 1-to-5, 1-to-7, and 5-to-5.

1.6. Encoder

The encoder is the standard ResNet.
The models are trained using 4–8 standard Tesla V100 GPUs per model.

1.7. Mixture-Based Representations

AMDIM is extended to use mixture-based features.
For each antecedent feature f1, a set of mixture features {f11, …, fk1} is computed, where k is the number of mixture components.
And {f11, …, fk1}=mk(f1), where mk(.) is a fully-connected network with a single ReLU hidden layer.
When using mixture features, the following objective is maximized:

where αH(q) is the entropy maximization term and the optimal distribution q is used as follows:

where τ is a temperature parameter, originated from Distillation, that controls the entropy of q.

2. Results

**(a): ImageNet and Imagenet→Places205 transfer tasks using linear evaluation. (b): CIFAR10 and CIFAR100, using linear and MLP evaluation. (c): Single ablations on STL10 and ImageNet.**

(a) AMDIM alrge models are trained for 150 epochs on 8 NVIDIA Tesla V100 GPUs. Small model is trained using a shorter 50 epoch schedule, it achieves 62.7% accuracy in 2 days on 4 GPUs.

(a): AMDIM outperforms prior and concurrent methods by a large margin.
(b): AMDIM features performed on par with classic fully-supervised models.
(c): The strongest results used the Fast AutoAugment augmentation policy.

The above stuffs, such as NCE calculation and multiple augmented view generation, are commonly used in self-supervised learning.

Reference

[2019 NeurIPS] [AMDIM]
Learning Representations by Maximizing Mutual Information Across Views

1.2. Self-Supervised Learning

1993 … 2019 [AMDIM] … 2022 [BEiT] [BEiT V2] [Masked Autoencoders (MAE)]