Review — Progressive GAN: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Progressive GAN, Add More Layers in a Progress Manner for Higher Resolution Image Generation

Sik-Ho Tsang
5 min readAug 14


1024 × 1024 images generated using the CELEBA-HQ dataset.

Progressive Growing of GANs for Improved Quality, Stability, and Variation
Progressive GAN, by NVIDIA, and Aalto University
2018 ICLR, Over 6600 Citations (Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)
Image Synthesis: 20142019 [SAGAN]
==== My Other Paper Readings Are Also Over Here ====

  • A new training methodology is proposed, which grows both the generator and discriminator progressively: starting from a low resolution, new layers are added that model increasingly fine details as training progresses. This can speed the training up and greatly stabilize the training.
  • A simple way is also proposed to increase the variation in generated images, and achieve a record Inception Score of 8.80 in unsupervised CIFAR-10.
  • A new metric, Sliced Wasserstein Distance (SWD), is also proposed for evaluating GAN results.


  1. Progressive GAN: Progressive Training
  2. Progressive GAN: Increasing Variation
  3. Progressive GAN: Normalization
  4. Progressive GAN: Sliced Wasserstein Distance (SWD)
  5. Results

1. Progressive GAN: Progressive Training

1.1. Conceptual Idea

Progressive Training

GANs are trained starting with low-resolution images, and then progressively increase the resolution by adding layers to the networks. This can help GAN shift attention to increasingly finer scale detail, instead of having to learn all scales simultaneously. By increasing the resolution little by little, a much simpler question is asked compared to the end goal.

  • Another benefit is the reduced training time. Up to 2–6 times faster can be achieved, depending on the final output resolution.
  • The idea is close to Autoencoder by training the network layer by layer.

1.2. Implementation

Implementation of Progressive Training
  • Generator and discriminator networks that are mirror images of each other and always grow in synchrony. All existing layers in both networks remain trainable throughout the training process.

When new layers are added to the networks, they are faded in smoothly using a concept of residual-like network. This avoids sudden shocks to the already well-trained, smaller-resolution layers.

2. Progressive GAN: Increasing Variation

  • Progressive GAN first computes the standard deviation for each feature in each spatial location over the minibatch, then average these estimates over all features and spatial locations to arrive at a single value.
  • This value is replicated and concatenated to all spatial locations and over the minibatch, yielding one additional (constant) feature map.
  • This procedure is found to be the best to insert it towards the end.

3. Progressive GAN: Normalization

3.1. Equalized Learning Rate

  • GANs are prone to the escalation of signal magnitudes as a result of unhealthy competition between the two networks.
  • Progressive GAN deviates from the trend of careful weight initialization, and instead use a trivial N(0, 1) initialization and then explicitly scale the weights at runtime. To be precise, ^wi=wi/c is used, where wi are the weights and c is the per-layer normalization constant from He’s initializer.

These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter.

3.2. Pixelwise Feature Vector Normalization in Generator

  • The feature vector in each pixel is normalized to unit length in the generator after each convolutional layer.
  • A variant of Local Response Normalization (LRN) in AlexNet is used:
  • where N is the number of feature maps, and ax,y and bx,y are the original and normalized feature vector in pixel (x, y), respectively.

This prevents the escalation of signal magnitudes very effectively when needed.

4. Progressive GAN: Sliced Wasserstein Distance (SWD)

  • Similar to MS-SSIM, SWD considers the multiscale statistical similarity between distributions of local image patches drawn from Laplacian pyramid.
  • A single Laplacian pyramid level corresponds to a specific spatial frequency band. 16384 images are randomly sampled and 128 descriptors are extracted from each level in the Laplacian pyramid, giving us 2²¹ (2.1M) descriptors per level.
  • Each descriptor x is a 7×7 pixel neighborhood with 3 color channels. The patches from level l of the training set and generated set are denoted as {xli} and {yli} for i from 1 to 2²¹, respectively.
  • {xli} and {yli} are first normalized w.r.t. the mean and standard deviation of each color channel, and then the statistical similarity is estimated by computing their sliced Wasserstein distance SWD ({xli}, {yli}), an efficiently computable randomized approximation to earthmovers distance (EMD) is used, using 512 projections.
  • Please feel free to check out the SWD codes by koshian2:

Intuitively a small Wasserstein distance indicates that the distribution of the patches is similar.

5. Results

5.1. SWD


MS-SSIM remains approximately unchanged because it measures only the variation between outputs, not similarity to the training set. SWD, on the other hand, does indicate a clear improvement.

  • (a): WGAN-GP is used as baseline.
  • With different components added to contribute the GAN training, SWD is lower and lower.
Visual Quality

The corresponding quality is shown.

5.2. Analysis

Progressive vs Normal

(a) Normal vs (b) Progressive: The largest-scale statistical similarity curve (16) reaches its optimal value very quickly and remains consistent throughout the rest of the training.

(c) Reduced Traning Time: The speedup from progressive growing increases as the output resolution grows.

5.3. LSUN Visual Quality

LSUN Bedroom Visual Quality
LSUN Visual Quality

LSUN image quality is shown as above.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.