Review — VAE-GAN: Autoencoding beyond pixels using a learned similarity metric
VAE-GAN: Combining VAE with GAN
In this story, Autoencoding beyond pixels using a learned similarity metric, (VAE-GAN), by Technical University of Denmark, University of Copenhagen, and Twitter, is briefly reviewed. In this paper:
- Variational autoencoder (VAE) is combined with a generative adversarial network (GAN).
- Thus, element-wise errors are replaced with feature-wise errors to better capture the data distribution.
This is a paper in 2016 ICML with over 1300 citations. (Sik-Ho Tsang @ Medium)
Outline
- VAE-GAN
- Experimental Results
1. VAE-GAN
- A VAE is combined with a GAN by collapsing the decoder and the generator into one.
- A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space:
- A VAE consists of two networks that encode a data sample x to a latent representation z and decode the latent representation back to data space.
- First, randomize a mini-batch X from dataset.
- Input X into Enc at VAE to get Z.
- Then, Lprior can be calculated:
- where DKL is the Kullback-Leibler (KL) divergence.
- Dec is used to reconstruct ~X.
- The VAE reconstruction (expected log likelihood) error term is a reconstruction error expressed in the GAN discriminator LDislllike. To achieve this, Disl(x) denote the hidden representation of the lth layer of the discriminator, a Gaussian observation model for Disl(x) with mean Disl(˜x) and identity covariance, is introduced:
- Zp sampled from N(0,I) is also decoded by Dec to generate Xp.
- Since both Dec and Gen map from z to x, we share the parameters between the two.
GAN adversarial loss LGAN consists of 3 terms. Thus, the discriminator needs to identify the real samples X, and also fake samples ~X generated from VAE as well as fake samples Xp generated from random latent vector:
- Finally, gradient updates are performed, each network part has its own loss combination for gradient update:
2. Experimental Results
2.1. CelebA Face Images
- After training, samples are drawn from p(z) and are then propagated through Dec to generate new images as above.
- The plain VAE is only able to draw the frontal part of the face sharply, but off-center the images get blurry.
- In comparison, VAE/GAN and pure GAN produce sharper images with more natural textures and face parts.
2.2. Visual Attribute Vectors
- For each attribute, the mean vector is computed for images with the attribute and the mean vector for images without the attribute.
- Then the visual attribute vector is computed as the difference between the two mean vectors.
- The idea is to find directions in the latent space corresponding to specific visual features in image space.
- Though not perfect, it can be seen that the attribute vectors capture semantic concepts like eyeglasses, bangs, etc.
4.3. Unsupervised Pretraining for Supervised Tasks
- VAE-GAN is used in a semi-supervised setup by unsupervised pretraining followed by finetuning using a small number of labeled examples.
- However, it is mentioned that it is not able to reach results competitive with the state-of-the-art results.
- (There are still other results shown in the paper. Please feel free to read the paper if interested.)
Reference
[2016 ICLR] [VAE-GAN]
Autoencoding beyond pixels using a learned similarity metric
Generative Adversarial Network (GAN)
Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [VAE-GAN] [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]
Image-to-image Translation [Pix2Pix] [UNIT] [CycleGAN] [MUNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]