Review — ALI: Adversarially Learned Inference (GAN)

Not Only Mapping from Latent Space to Data Space, But Also Mapping from Data Space to Latent Space, Outperforms DCGAN

5 min readMay 8, 2021

In this story, Adversarially Learned Inference, (ALI), by Université de Montréal, Stanford, New York University, and CIFAR Fellow, is briefly reviewed. In this story:

The generation network maps samples from stochastic latent variables to the data space.
The inference network maps training examples in data space to the space of latent variables.
The discriminative network is trained to distinguish between joint latent/data-space samples from the generative network and joint samples from the inference network.

This is a paper in 2017 ICLR with over 1000 citations. (Sik-Ho Tsang @ Medium)

The idea is the same as BiGAN, but they are proposed independently and published in the same conference (2017 ICLR). Some papers would cite both ALI and BiGAN together when talking about this idea.

Outline

ALI: Overall Structure
Experimental Results

1. ALI: Overall Structure

**The adversarially learned inference (ALI) game**

Similar to BiGAN, in order to match the joint distributions, an adversarial game is played, as shown above.
Gz is the inference network in ALI, we can treat it as encoder.
Gx is the generation network in ALI, we can treat it as decoder.

Joint pairs (x, i) are drawn either from q(x, z) or p(x, z), and a discriminator network learns to discriminate between the two, while the encoder and decoder networks are trained to fool the discriminator.

If we treat Gz and Gx in ALI as encoder E and decoder(generator) G respectively, it is a bidirectional GAN (BiGAN).

Unlike the GAN where the discriminator sees only x as input, in the BiGAN/ALI, D sees both x and z , i.e., the observation and its latent representation together.
For a true sample, x is given (it is taken from the training set) and the corresponding z is generated by the encoder E.
For a fake sample, z is given (it is sampled from p(z)) and its corresponding x is generated by the generator G.

The encoder E is also implemented as a deep neural network and (as in the AE) its architecture is usually taken as the inverse of G.
It is trained just like the generator, namely by back-propagating from the loss function defined at the output of the discriminator.
⊕ is the vector concatenation operation to concatenate the flatten x and latent vector z before input into the discriminator.

Once training is complete, just like we can use the generator to predict x for new z, we can use the encoder to predict z for any x.

2. Experimental Results

2.1. Samples and Reconstruction

Below shows the (a) original samples, and (b) the reconstructed samples by ALI, for different datasets.
For the reconstructions in (b), odd columns are original samples from the validation set and even columns are corresponding reconstructions.

**Samples and reconstructions on the SVHN dataset.**

**Samples and reconstructions on the CelebA dataset.**

**Samples and reconstructions on the CIFAR10 dataset.**

**Samples and reconstructions on the Tiny ImageNet dataset**

We observe that reconstructions are not always faithful reproductions of the inputs. They retain the same crispness and quality characteristic to adversarially-trained models, but oftentimes make mistakes in capturing exact object placement, color, style and (in extreme cases) object identity.
Note that the ALI training objective does not involve an explicit reconstruction loss.

2.2. Latent Space Interpolations

**Latent space interpolations on the CelebA validation set.**

By linearly interpolating between z1 and z2 and passing the intermediary points through the decoder, the above plot is generated at the input-space interpolations.
Smooth transitions are observed between pairs of examples, and intermediary images remain believable.

2.3. Semi-Supervised Learning

**SVHN test set missclassification rate.**

Using ALI’s inference network to extract features, then using SVM to predict the class using ALI’s extracted features, a misclassification rate is achieved that is roughly 3.00% lower than DCGAN.

**CIFAR10 test set missclassification rate for semi-supervised learning using different numbers of trained labeled examples.**

ALI’s performance is investigated as well when label information is taken into account during training.
The discriminator takes x and z as input and outputs a distribution over K+1 classes, where K is the number of categories.
The above table shows that ALI offers a modest improvement over Salimans et al. (2016), more specifically for 1000 and 2000 labeled examples.

It is conjectured that the latent representation learned by ALI is better untangled with respect to the classification task and that it generalizes better.

2.4. Conditional Generation

ALI is extended to match a conditional distribution where y represent a fully observed conditioning variable, e.g. attributes in CelebA dataset.

We can treat this as ALI+CGAN.
(I) to (IV): A single fixed latent code z is sampled.
(a) to (l): Attributes are then varied uniformly over rows across all columns in the following sequence: (b) black hair; (c) brown hair; (d) blond hair; (e) black hair, wavy hair; (f) blond hair, bangs; (g) blond hair, receding hairline; (h) blond hair, balding; (i) black hair, smiling; (j) black hair, smiling, mouth slightly open; (k) black hair, smiling, mouth slightly open, eyeglasses; (l) black hair, smiling, mouth slightly open, eyeglasses, wearing hat.
(Since ALI is similar to BiGAN, I don’t describe much. If interested, please feel free to read the paper for more details.)

Reference

[2017 ICLR] [ALI]
Adversarially Learned Inference

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [SimGAN] [BiGAN] [ALI]
Image-to-image Translation [Pix2Pix] [UNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]