Review: LAPGAN — Laplacian Generative Adversarial Network (GAN)

Generate More Realistic Images Compared With Original GAN

6 min readApr 19, 2020

--

In this story, Laplacian Pyramid of Adversarial Networks (LAPGAN), by New York University and Facebook AI Research (FAIR), is reviewed. With the combination of Laplacian pyramid network and CGAN, LAPGAN can generate more realistic images compared with original GAN. This is a paper in 2015 NIPS with over 1400 citations. (Sik-Ho Tsang @ Medium)

Outline

GAN & Conditional GAN (CGAN)
Laplacian Pyramid
LAPGAN
Experimental Results

1. GAN & Conditional GAN (CGAN)

1.1. GAN

In GAN, a generative model G and a discriminative model D are trained.
The generative model G captures the data distribution.
The discriminative model D that distinguishes between samples drawn from G and images drawn from the training data.
A minimax objective is used to train both models together:

1.2. Conditional GAN (CGAN)

In CGAN, G and D receive an additional vector of information l as input.
This l might contain, say, information about the class of the training example h. The loss function thus becomes:

where p_l(l) can be the prior distribution over classes. Thus, CGAN allows the output of the generative model to be controlled by the conditioning variable l.

2. Laplacian Pyramid

**Laplacian Pyramid Generated From Gaussian Pyramid**

The Laplacian pyramid is a linear invertible image representation consisting of a set of band-pass images, spaced an octave apart, plus a low-frequency residual.
d(.): Downsampling operation, d(I), with image I of size j×j got a new image of size j/2×j/2.
u(.): Upsampling operation, u(I), with image I of size j×j got a new image of size 2j×2j.
A Gaussian pyramid G(I) = [I_0, I_1, …, I_K] is built, where I_0 = I and I_k is k repeated applications of d(.) to I. K is the number of levels in the pyramid, selected so that the final level has very small spatial extent (≤8×8 pixels).
The coefficients h_k at each level k of the Laplacian pyramid L(I) are constructed by taking the difference between adjacent levels in the Gaussian pyramid, upsampling the smaller one with u(.):

Thus, reconstruction from a Laplacian pyramid coefficients [h_1, …, h_K] is performed using the backward recurrence:

By repeating, we can get back to the full resolution image.

3. LAPGAN

LAPGAN combines the CGAN model with a Laplacian pyramid representation.

3.1. Generative Network

**Generative Network: upsampled (green arrow), conditioning variable (orange arrow)**

First, starts by:

Thus, the image at K is only generated by using noise vector z_K:

After that, models at all levels take an upsampled version of the current image ~I_k+1 as a conditioning variable, in addition to the noise vector z_k, as shown above to output ~h_k:

Specifically, the above figure shows a pyramid with K = 3 using 4 generative models to sample a 64×64 image.
The generative models {G_0, …,G_K} are trained using the CGAN approach at each level of the pyramid, where G_k is CNN.

3.2. Discriminative Network

**Discriminative Network: downsample (red arrow), upsample (green arrow), real case (blue arrow), generated case (magenta arrows)**

D_k takes as input h_k or ~h_k, and predicts if the image is real or generated.
Using LAPGAN, the generation can be broken into successive refinements which is the key idea in LAPGAN, that can focus on making each step plausible.

4. Experimental Results

4.1. Datasets & Networks

Three datasets are tested.
(i) CIFAR10: 32×32 pixel color images of 10 different classes, 100k training samples with tight crops of objects.
(ii) STL: 96×96 pixel color images of 10 different classes, 100k training samples (we use the unlabeled portion of data).
(iii) LSUN: 10M images of 10 different natural scene types, downsampled to 64×64 pixels.
All models are implemented by Torch, and the noise vector z_k is drawn from a uniform [-1,1] distribution.

4.1.1 CIFAR10 and STL

Initial scale: This operates at 8×8 resolution, using densely connected nets for both G_K & D_K with 2 hidden layers and ReLU non-linearities. D_K uses Dropout and has 600 units/layer vs 1200 for G_K. z_K is a 100-d vector.
Subsequent scale: For both datasets, G_k & D_k are CNN with 3 and 2 layers, respectively.
For CIFAR10, the two subsequent levels of the pyramid are 8→14→28.
For STL, 4 levels are used from 8→16→32→64→96.
For CIFAR10, a class conditional version of the model is also explored, where an additional vector c encodes the label.

4.1.2 LSUN

The four subsequent scales 4→8→16→32→64 are used for G_K & D_K. G_k is a 5-layer CNN with {64, 368, 128, 224} feature maps and a linear output layer.
7×7 filters, ReLUs, batch normalization and Dropout are used at each hidden layer.
D_k has 3 hidden layers with {48, 448, 416} maps plus a sigmoid output.

4.2. Experimental Results

**Parzen window based log-likelihood estimates**

LAPGAN achieving a significantly higher log-likelihood than GAN on both datasets.

**CIFAR10 Samples (Rightmost: The nearest training sample)**

The above figure shows samples from our models trained on CIFAR10. Samples from the class conditional LAPGAN are organized by class.
The reimplementation of the standard GAN obtains slightly sharper images. The LAPGAN samples improve upon the standard GAN samples.

**STL samples: (a) Random 96x96 samples from our LAPGAN model. (b) Coarse-to-fine generation chain.**

(a): shows samples from our LAPGAN model trained on STL. Here, a clear object shape is lost but the samples remain sharp.
(b): shows the generation chain for random STL samples.

**64×64 samples from three different LSUN LAPGAN models (top: tower, middle: bedroom, bottom: church front).**

The above samples are from LAPGAN models trained on three LSUN categories (tower, bedroom, church front).
The 4×4 validation image used to start the generation process is shown in the first column, along with 10 different 64×64 samples, which illustrate the inherent variation captured by the model.
The size of 64×64 was quite large at that moment for GAN series. No other generative model has been able to produce samples of this complexity at that moment.

**Left**: Human evaluation of real CIFAR10 images (red) and samples from GAN (magenta), LAPGAN (blue) and a class conditional LAPGAN (green). **Right**: The user-interface presented to the subjects

15 volunteers participate in an experiment to see if they could distinguish our samples from real images.
Random four different types of image are shown: samples drawn from three different GAN models trained on CIFAR10 ((i) LAPGAN, (ii) class conditional LAPGAN and (iii) standard GAN[10]) and also real CIFAR10 images.
Around 40% of the samples generated by class conditional LAPGAN model are realistic enough to fool a human into thinking they are real images.
≤ 10% of images from the standard GAN model.
> 90% rate for real images.

Reference

[2015 NIPS] [LAPGAN]
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Generative Adversarial Network

[GAN] [CGAN] [LAPGAN]