Review — InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Besides Using Latent Vector z, Latent Code c is also Input to GAN, for Learning Disentangled Representations

6 min readSep 8, 2021

In this story, Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets, (InfoGAN), by OpenAI, is reviewed. In this paper:

InfoGAN is designed to maximize the mutual information between a small subset of the latent variables and the observation.
A lower bound of the mutual information objective is derived that can be optimized efficiently.
By doing so, InfoGAN successfully disentangles writing styles from digit shapes on MNIST dataset, and disentangles the visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset.

This is a paper in 2016 NIPS with over 3000 citations. (Sik-Ho Tsang @ Medium)

Outline

InfoGAN Concept
InfoGAN Framework
Experimental Results

1. InfoGAN Concept

1.1. MinMax Game Using Mutual Information

In information theory, mutual information between X and Y, I(X, Y), measures the “amount of information” learned from knowledge of random variable Y about the other random variable X.

The mutual information can be expressed as the difference of two entropy terms:

I(X; Y) is the reduction of uncertainty in X when Y is observed.
If X and Y are independent, then I(X; Y) = 0, because knowing one variable reveals nothing about the other.

By contrast, if X and Y are related by a deterministic, invertible function, then maximal mutual information is attained.

Similar mutual information inspired objectives have been considered before in the context of clustering [23–25].
This interpretation makes it easy to formulate a cost:

Given any x~P_G(x), we want P_G(c|x) to have a small entropy. In other words, the information in the latent code c should not be lost in the generation process.

In regular GAN, the minmax game is:

Now, the following information-regularized minimax game is solved:

1.2. Variational Mutual Information Maximization

In practice, the mutual information term I(c; G(z, c)) is hard to maximize directly as it requires access to the posterior P(c|x).
But a lower bound of it can be obtained by defining an auxiliary distribution Q(c|x) to approximate P(c|x). A variational lower bound,
LI(G, Q), of the mutual information:

LI(G, Q) is added to GAN’s objectives with no change to GAN’s training procedure, which resulting Information Maximizing Generative Adversarial Networks (InfoGAN).
InfoGAN is defined as the following minimax game with a variational regularization of mutual information LI(G, Q) and a hyperparameter λ:

2. InfoGAN Framework

**InfoGAN Framework** (Figure from InfoGAN 简介与代码实战)

The auxiliary distribution Q is parametrized as a neural network.
In most experiments, Q and D share all convolutional layers and there is one final fully connected layer to output parameters for the conditional distribution Q(c|x), which means InfoGAN only adds a negligible computation cost to GAN.
It is observed that LI(G, Q) always converges faster than normal GAN objectives and hence InfoGAN essentially comes for free with GAN.
For categorical latent code ci, we use the natural choice of softmax nonlinearity to represent Q(ci|x).
For continuous latent code cj, there are more options depending on what is the true posterior Q(cj|x). In the experiments, simply treating Q(cj|x) as a factored Gaussian is sufficient.
The experiments are based on existing techniques introduced by DCGAN.
Simply setting λ to 1 is sufficient for discrete latent codes. When the latent code contains continuous variables, a smaller λ is typically used.

3. Experimental Results

3.1 Mutual Information Maximization

**Lower bound LI over training iterations**

InfoGAN is trained on MNIST dataset with a uniform categorical distribution on latent codes c~Cat(K=10, p=0.1).
In the above figure, the lower bound LI(G, Q) is quickly maximized to H(c) ≈ 2.30, which means the derived bound is tight and maximal mutual information is achieved.
On the other hand, the generator of regular GAN is not explicitly encouraged to maximize the mutual information with the latent codes. Hence there is little mutual information between latent codes and generated images in regular GAN.

3.2. Disentangled Representation

3.2.1. MNIST

(a): The discrete code c1 captures drastic change in shape. Changing categorical code c1 switches between digits most of the time.
If InfoGAN is trained without any label, c1 can be used as a classifier that achieves 5% error rate in classifying MNIST digits by matching each category in c1 to a digit type. In the second row of (a), we can observe a digit 7 is classified as a 9.
(b): For regular GAN, no clear meaning on changing categorical code c1.
(c)-(d): Two continuous codes c2 and c3 are added to capture variations that are continuous in nature: c2, c3~Unif(-1, 1).

Particularly, continuous codes c2, c3 capture continuous variations in style: c2 models rotation of digits and c3 controls the width.
Images plotted from -2 to 2 covering a wide region that the network was never trained on and we still get meaningful generalization.

3.2.2. 3D Faces & 3D Chairs

**Manipulating Latent Codes on 3D Faces**

In this experiment, the latent codes with five continuous codes are used.

InfoGAN learns a disentangled representation that recover azimuth (pose), elevation, lighting, and wide/narrow.

**Manipulating Latent Codes on 3D Chairs**

In this experiment, the latent factors with four categorical codes and one continuous code are used.

InfoGAN is also able to continuously interpolate between similar chair types of different widths using a single continuous code.

3.2.3. Street View House Number (SVHN)

Street View House Number (SVHN) dataset is significantly more challenging to learn an interpretable representation because it is noisy, containing images of variable-resolution and distracting digits, and it does not have multiple variations of the same object.
Four 10-dimensional categorical variables and two uniform continuous variables as latent codes are used.

InfoGAN can learn the disentangled representation that recover lighting and plate context.

3.2.4. CelebA

CelebA includes 200,000 celebrity images with large pose variations and background clutter.
The latent variation as 10 uniform categorical variables, each of dimension 10, are used.
Surprisingly, even in this complicated dataset, InfoGAN can recover azimuth as in 3D images even though in this dataset no single face appears in multiple pose positions.

Moreover InfoGAN can disentangle other highly semantic variations like presence or absence of glasses, hairstyles and emotion, demonstrating a level of visual understanding is acquired.

While DC-IGN [7] was shown to learn highly interpretable graphics codes, it requires supervision, it was previously not possible to learn a latent code for a variation that’s unlabeled and hence salient latent factors of variation cannot be discovered automatically from data.

By contrast, InfoGAN is able to discover such variation on its own.

(Btw, mutual information maximization is one of the essential elements for self-supervised learning as well.)

Reference

[2016 NIPS] [InfoGAN]
Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Generative Adversarial Network (GAN)

Image Synthesis: 2014 [GAN] [CGAN] 2015 [LAPGAN] 2016 [AAE] [DCGAN] [CoGAN] [VAE-GAN] [InfoGAN] 2017 [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]
Image-to-image Translation: 2017 [Pix2Pix] [UNIT] [CycleGAN] 2018 [MUNIT]