Brief Review — StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN++, Introduced StackGAN-v1 & StackGAN-v2

5 min readAug 10, 2023

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
StackGAN++, StackGAN-v2, by Rutgers University, Lehigh University, The Chinese University of Hong Kong, and University of North Carolina at Charlotte
2018 TPAMI, Over 1200 Citations (Sik-Ho Tsang @ Medium)
Generative Adversarial Network (GAN)
Image Synthesis: 2014 … 2019 [SAGAN]
==== My Other Paper Readings Are Also Over Here ====

(StackGAN-v1 is proposed which is essentially the same as the one published in 2017 ICCV. In this story, I will focus on StackGAN-v2, which is proposed for both conditional and unconditional generative tasks.)
StackGAN-v2 consists of multiple generators and multiple discriminators arranged in a tree-like structure; images at multiple scales corresponding to the same scene are generated from different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximating multiple distributions.

Outline

StackGAN-v2
Results

1. StackGAN-v2

1.1. Model Framework

StackGAN-v2 framework has a tree-like structure, which takes a noise vector z ~ pnoise as the input and has multiple generators to produce images of different scales.

The hidden features hi are computed for each generator Gi by a non-linear transformation:

where hi represents hidden features for the ith branch, m is the total number of branches, and Fi are modeled as neural networks.
The noise vector z is concatenated to the hidden features hi-1 as the inputs of Fi for calculating hi.
At the end, generators produce samples of small-to-large scales (s0, s1, …, sm-1):

Following each generator Gi, a discriminator Di, which takes a real image xi or a fake sample si as input, is trained to classify inputs into two classes (real or fake) by minimizing the following cross-entropy loss:

The multiple discriminators are trained in parallel, and each of them focuses on a single image scale.
Guided by the trained discriminators, the generators are optimized to jointly approximate multi-scale image distributions (pdata0,pdata1, …, pdatam-1) by minimizing the following loss function:

During the training process, the discriminators Di and the generators Gi are alternately optimized till convergence.

The motivation of the proposed StackGAN-v2 is that, by modeling data distributions at multiple scales, if any one of those model distributions shares support with the real data distribution at that scale, the overlap could provide good gradient signal to expedite or stabilize training of the whole network at multiple scales.
For instance, approximating the low-resolution image distribution at the first branch results in images with basic color and structures. Then the generators at the subsequent branches can focus on completing details for generating higher resolution images.

1.2. Joint Conditional and Unconditional Distribution Approximation

For the generator of the conditional StackGAN-v2, F0 and Fi are converted to take the conditioning vector c as input, such that h0=F0(c, z) and hi=Fi(hi-1, c). For Fi, the conditioning vector c replaces the noise vector z.
The objective function of training the discriminator Di for conditional StackGAN-v2 now consists of two terms, the unconditional loss and the conditional loss:

The unconditional loss determines whether the image is real or fake while the conditional one determines whether the image and the condition match or not. Accordingly, the loss function for each generator Gi is converted to:

1.3. Color-Consistency Regularization

Let xk=(R, G, B)T represent a pixel in a generated image, then the mean and covariance of pixels of the given image can be defined by

The color-consistency regularization term aims at minimizing the differences of μ and σ between different scales to encourage the consistency:

Thus, the final loss for training the i-th generator is defined as:

α=50 for the unconditional task, while it is not needed (α=0) for the text-to-image synthesis task.

2. Results

2.1. Inception Score & FID

For unconditional generation, the samples generated by StackGAN-v2 are consistently better than those by StackGAN-v1 (last four columns in TABLE 3)