Brief Review —WGAN-GP: Improved Training of Wasserstein GANs

WGAN With Gradient Penalty, Instead of WGAN With Weight Clipping

Sik-Ho Tsang
4 min readAug 1, 2023

Improved Training of Wasserstein GANs
WGAN-GP, by Montreal Institute for Learning Algorithms, Courant Institute of Mathematical Sciences, and CIFAR Fellow
2017 NIPS, Over 9400 Citations (Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)
Image Synthesis: 20142019 [SAGAN]
==== My Other Paper Readings Are Also Over Here ====

  • Wasserstein GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only poor samples or fail to converge. It is due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic.
  • In this paper, WGAN-GP proposes an alternative: penalize the norm of gradient of the critic with respect to its input.


  1. Preliminaries
  2. WGAN With Gradient Penalty (WGAN-GP)
  3. Results

1. Preliminaries

1.1. Standard GAN

  • The game between the generator G and the discriminator D is the minimax objective:

It is unstable during training.

1.2. Wasserstein GAN (WGAN)

WGAN Training
  • WGAN propose instead using the Earth-Mover (also called Wasserstein-1) distance W(q, p).
  • where D is the set of 1-Lipschitz functions.
  • Lipschitz function is a function below:
  • It is supposed that when x and y is close, f(x) and f(y) is also close.
  • Thus, when f is the critic (or discriminator) which is enforced to under the Lipschitz constraint, then the training is stable.

WGAN uses weight clipping (clip function in line 7) to enforce a Lipschitz constraint on the critic.



Instead of weight clipping, in WGAN-GP, gradient penalty (GP) is applied as a soft constraint.

  • λ=10 is found to work well.
  • No critic batch normalization, layer normalization is used as a drop-in replacement for batch normalization.
  • Two-sided penalty: WGAN-GP encourages the norm of the gradient to go towards 1 (two-sided penalty) instead of just staying below 1 (one-sided penalty).

3. Results

3.1. Random Architectures

200 Random Architectures are Sampled.

Starting from the DCGAN architecture, a set of architecture variants is defined by changing model settings.

Training Succeeds or Not

WGAN-GP successfully trains many architectures from this set which the standard GAN objective nearly always failed.

Only succeeded in training every architecture with a shared set of hyperparameters using WGAN-GP.

3.2. CIFAR-10

CIFAR-10 Inception Score

WGAN-GP converges more slowly (in wall-clock time) than DCGAN, but its score is more stable at convergence.

CIFAR-10 Inception Score

Left: An architecture is found to establish a new state of the art Inception Score on unsupervised CIFAR-10.

Right: The proposed conditional model outperforms all others except SGAN.

A deep ResNet is successfully trained on 128 × 128 LSUN bedrooms as above.

  • To the best of authors’ knowledge, this is the first time very deep ResNets were successfully trained in a GAN setting.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.