Review — InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Besides Using Latent Vector z, Latent Code c is also Input to GAN, for Learning Disentangled Representations

In this story, Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets, (InfoGAN), by OpenAI, is reviewed. In this paper:

  • A lower bound of the mutual information objective is derived that can be optimized efficiently.
  • By doing so, InfoGAN successfully disentangles writing styles from digit shapes on MNIST dataset, and disentangles the visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset.

Outline

  1. InfoGAN Concept
  2. InfoGAN Framework
  3. Experimental Results

1. InfoGAN Concept

1.1. MinMax Game Using Mutual Information

  • If X and Y are independent, then I(X; Y) = 0, because knowing one variable reveals nothing about the other.
  • This interpretation makes it easy to formulate a cost:

1.2. Variational Mutual Information Maximization

  • In practice, the mutual information term I(c; G(z, c)) is hard to maximize directly as it requires access to the posterior P(c|x).
  • But a lower bound of it can be obtained by defining an auxiliary distribution Q(c|x) to approximate P(c|x). A variational lower bound,
    LI(G, Q), of the mutual information:
  • InfoGAN is defined as the following minimax game with a variational regularization of mutual information LI(G, Q) and a hyperparameter λ:

2. InfoGAN Framework

InfoGAN Framework (Figure from InfoGAN 简介与代码实战)
  • In most experiments, Q and D share all convolutional layers and there is one final fully connected layer to output parameters for the conditional distribution Q(c|x), which means InfoGAN only adds a negligible computation cost to GAN.
  • It is observed that LI(G, Q) always converges faster than normal GAN objectives and hence InfoGAN essentially comes for free with GAN.
  • For categorical latent code ci, we use the natural choice of softmax nonlinearity to represent Q(ci|x).
  • For continuous latent code cj, there are more options depending on what is the true posterior Q(cj|x). In the experiments, simply treating Q(cj|x) as a factored Gaussian is sufficient.
  • The experiments are based on existing techniques introduced by DCGAN.
  • Simply setting λ to 1 is sufficient for discrete latent codes. When the latent code contains continuous variables, a smaller λ is typically used.

3. Experimental Results

3.1 Mutual Information Maximization

Lower bound LI over training iterations
  • In the above figure, the lower bound LI(G, Q) is quickly maximized to H(c) ≈ 2.30, which means the derived bound is tight and maximal mutual information is achieved.
  • On the other hand, the generator of regular GAN is not explicitly encouraged to maximize the mutual information with the latent codes. Hence there is little mutual information between latent codes and generated images in regular GAN.

3.2. Disentangled Representation

3.2.1. MNIST

Manipulating Latent Codes on MNIST
  • If InfoGAN is trained without any label, c1 can be used as a classifier that achieves 5% error rate in classifying MNIST digits by matching each category in c1 to a digit type. In the second row of (a), we can observe a digit 7 is classified as a 9.
  • (b): For regular GAN, no clear meaning on changing categorical code c1.
  • (c)-(d): Two continuous codes c2 and c3 are added to capture variations that are continuous in nature: c2, c3~Unif(-1, 1).

3.2.2. 3D Faces & 3D Chairs

Manipulating Latent Codes on 3D Faces
Manipulating Latent Codes on 3D Chairs

3.2.3. Street View House Number (SVHN)

Manipulating Latent Codes on SVHN
  • Four 10-dimensional categorical variables and two uniform continuous variables as latent codes are used.

3.2.4. CelebA

Manipulating Latent Codes on CelebA
  • The latent variation as 10 uniform categorical variables, each of dimension 10, are used.
  • Surprisingly, even in this complicated dataset, InfoGAN can recover azimuth as in 3D images even though in this dataset no single face appears in multiple pose positions.

Reference

[2016 NIPS] [InfoGAN]
Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Generative Adversarial Network (GAN)

Image Synthesis: 2014 [GAN] [CGAN] 2015 [LAPGAN] 2016 [AAE] [DCGAN] [CoGAN] [VAE-GAN] [InfoGAN] 2017 [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]
Image-to-image Translation: 2017 [Pix2Pix] [UNIT] [CycleGAN] 2018 [MUNIT]

My Other Previous Paper Readings

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG