Review — VAE-GAN: Autoencoding beyond pixels using a learned similarity metric

VAE-GAN: Combining VAE with GAN

Sik-Ho Tsang
4 min readAug 8, 2021
VAE-GAN (Figure from UU-Nets Connecting Discriminator and Generator for Image to Image Translation)

In this story, Autoencoding beyond pixels using a learned similarity metric, (VAE-GAN), by Technical University of Denmark, University of Copenhagen, and Twitter, is briefly reviewed. In this paper:

  • Variational autoencoder (VAE) is combined with a generative adversarial network (GAN).
  • Thus, element-wise errors are replaced with feature-wise errors to better capture the data distribution.

This is a paper in 2016 ICML with over 1300 citations. (

@ Medium)


  1. VAE-GAN
  2. Experimental Results


VAE-GAN Framework Overview
  • A VAE is combined with a GAN by collapsing the decoder and the generator into one.
  • A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space:
VAE-GAN Training Procedures
  • A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space.
  • First, randomize a mini-batch X from dataset.
  • Input X into Enc at VAE to get Z.
  • Then, Lprior can be calculated:
  • where DKL is the Kullback-Leibler (KL) divergence.
  • Dec is used to reconstruct ~X.
  • The VAE reconstruction (expected log likelihood) error term is a reconstruction error expressed in the GAN discriminator LDislllike. To achieve this, Disl(x) denote the hidden representation of the lth layer of the discriminator, a Gaussian observation model for Disl(x) with mean Dislx) and identity covariance, is introduced:
  • Zp sampled from N(0,I) is also decoded by Dec to generate Xp.
Gen = Dec
  • Since both Dec and Gen map from z to x, we share the parameters between the two.

GAN adversarial loss LGAN consists of 3 terms. Thus, the discriminator needs to identify the real samples X, and also fake samples ~X generated from VAE as well as fake samples Xp generated from random latent vector:

  • Finally, gradient updates are performed, each network part has its own loss combination for gradient update:
Architectures for the three networks that comprise VAE-GAN

2. Experimental Results

2.1. CelebA Face Images

Samples from different generative models
  • After training, samples are drawn from p(z) and are then propagated through Dec to generate new images as above.
  • The plain VAE is only able to draw the frontal part of the face sharply, but off-center the images get blurry.
  • In comparison, VAE/GAN and pure GAN produce sharper images with more natural textures and face parts.

2.2. Visual Attribute Vectors

Using the VAE-GAN model to reconstruct dataset samples with visual attribute vectors added to their latent representations
  • For each attribute, the mean vector is computed for images with the attribute and the mean vector for images without the attribute.
  • Then the visual attribute vector is computed as the difference between the two mean vectors.
  • The idea is to find directions in the latent space corresponding to specific visual features in image space.
  • Though not perfect, it can be seen that the attribute vectors capture semantic concepts like eyeglasses, bangs, etc.

4.3. Unsupervised Pretraining for Supervised Tasks

  • VAE-GAN is used in a semi-supervised setup by unsupervised pretraining followed by finetuning using a small number of labeled examples.
  • However, it is mentioned that it is not able to reach results competitive with the state-of-the-art results.
  • (There are still other results shown in the paper. Please feel free to read the paper if interested.)


[2016 ICLR] [VAE-GAN]
Autoencoding beyond pixels using a learned similarity metric

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [VAE-GAN] [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]
Image-to-image Translation [Pix2Pix] [UNIT] [CycleGAN] [MUNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding
[VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.