Review — VAE-GAN: Autoencoding beyond pixels using a learned similarity metric

VAE-GAN: Combining VAE with GAN

4 min readAug 8, 2021

**VAE-GAN** (Figure from UU-Nets Connecting Discriminator and Generator for Image to Image Translation)

In this story, Autoencoding beyond pixels using a learned similarity metric, (VAE-GAN), by Technical University of Denmark, University of Copenhagen, and Twitter, is briefly reviewed. In this paper:

Variational autoencoder (VAE) is combined with a generative adversarial network (GAN).
Thus, element-wise errors are replaced with feature-wise errors to better capture the data distribution.

This is a paper in 2016 ICML with over 1300 citations. (Sik-Ho Tsang @ Medium)

Outline

VAE-GAN
Experimental Results

1. VAE-GAN

A VAE is combined with a GAN by collapsing the decoder and the generator into one.
A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space:

A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space.
First, randomize a mini-batch X from dataset.
Input X into Enc at VAE to get Z.

Then, Lprior can be calculated:

where DKL is the Kullback-Leibler (KL) divergence.
Dec is used to reconstruct ~X.

The VAE reconstruction (expected log likelihood) error term is a reconstruction error expressed in the GAN discriminator LDislllike. To achieve this, Disl(x) denote the hidden representation of the lth layer of the discriminator, a Gaussian observation model for Disl(x) with mean Disl(˜x) and identity covariance, is introduced:

Zp sampled from N(0,I) is also decoded by Dec to generate Xp.

Gen = Dec

Since both Dec and Gen map from z to x, we share the parameters between the two.

GAN adversarial loss LGAN consists of 3 terms. Thus, the discriminator needs to identify the real samples X, and also fake samples ~X generated from VAE as well as fake samples Xp generated from random latent vector:

Finally, gradient updates are performed, each network part has its own loss combination for gradient update:

**Architectures for the three networks that comprise VAE-GAN**

2. Experimental Results

2.1. CelebA Face Images

**Samples from different generative models**

After training, samples are drawn from p(z) and are then propagated through Dec to generate new images as above.
The plain VAE is only able to draw the frontal part of the face sharply, but off-center the images get blurry.
In comparison, VAE/GAN and pure GAN produce sharper images with more natural textures and face parts.

2.2. Visual Attribute Vectors

**Using the VAE-GAN model to reconstruct dataset samples with visual attribute vectors added to their latent representations**

For each attribute, the mean vector is computed for images with the attribute and the mean vector for images without the attribute.
Then the visual attribute vector is computed as the difference between the two mean vectors.
The idea is to find directions in the latent space corresponding to specific visual features in image space.
Though not perfect, it can be seen that the attribute vectors capture semantic concepts like eyeglasses, bangs, etc.

4.3. Unsupervised Pretraining for Supervised Tasks

VAE-GAN is used in a semi-supervised setup by unsupervised pretraining followed by finetuning using a small number of labeled examples.
However, it is mentioned that it is not able to reach results competitive with the state-of-the-art results.
(There are still other results shown in the paper. Please feel free to read the paper if interested.)

Reference

[2016 ICLR] [VAE-GAN]
Autoencoding beyond pixels using a learned similarity metric

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [VAE-GAN] [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]
Image-to-image Translation [Pix2Pix] [UNIT] [CycleGAN] [MUNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]