Brief Review — StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN, Stacked Two Conditional GANs for Generating High Resolution Images

Sik-Ho Tsang
4 min readAug 8


Comparison Between StackGAN and Vanilla GAN

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
StackGAN, StackGAN-v1, by Rutgers University, Lehigh University, The Chinese University of Hong Kong, and Baidu Research
2017 ICCV, Over 2900 Citations (Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)
Image Synthesis: 20142019 [SAGAN]
==== My Other Paper Readings Are Also Over Here ====

  • Stacked Generative Adversarial Networks (StackGAN) is proposed to generate 256×256 photo-realistic images conditioned on text descriptions. The hard problem is decomposed into more manageable sub-problems through a sketch-refinement process.
  • Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images.
  • Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details.
  • Later on, StackGAN-v2 is also proposed in 2018 TPAMI, StackGAN++.


  1. StackGAN or StackGAN-v1
  2. Results

1. StackGAN or StackGAN-v1


1.1. Stage-I GAN (Top)

Stage-I GAN simplifies the task to first generate a low-resolution image, which focuses on drawing only rough shape and correct colors for the object.

  • Let φt be the text embedding of the given description, which is generated by a pre-trained encoder in this paper.
  • Conditional Augmentation (CA): The Gaussian conditioning variables ^c0 for text embedding are sampled from N(μ0(φt), Σ0(φt)) to capture the meaning of φt with variations. This can provide training stability and sample diversity, e.g.: various poses and appearances.
  • Conditioned on ^c0 and random variable z, Stage-I GAN trains the discriminator D0 and the generator G0 by alternatively maximizing LD0 in Eq. (3) and minimizing LG0 in Eq. (4):
  • where the real image I0 and the text description t are from the true data distribution pdata. z is a noise vector randomly sampled from a given distribution pz. λ=1.
  • (For the model architecture, it is abstract in the figure, so, I don’t describe them in detail.)

At the end of discriminator, the image filter map (blue) is concatenated along the channel dimension with the text tensor (green) where the text tensor is spatially replicated from the text description.

1.2. Stage-II GAN

  • It is conditioned on low-resolution images and also the text embedding again to correct defects in Stage-I results.

Stage-II GAN learns to capture useful information in the text embedding that is omitted by Stage-I GAN.

  • Stage-II generator is an encoder-decoder network with residual blocks.
  • Conditioning Augmentation (CA) is also used here.
  • The image features and the text features are concatenated along the channel dimension to learn multi-modal representations across image and text features.
  • Finally, a series of up-sampling layers (decoder) are used to generate a W×H high-resolution image.
  • Rather than using the vanilla discriminator, the matching-aware discriminator is used for both stages.

2. Results

2.1. Visual Quality

CUB Test Set
Oxford-102 Test Set

2.2. SOTA Comparisons

Inception Score

StackGAN achieves the best Inception Score and average human rank on all three datasets.

  • Compared with GAN-INT-CLS, StackGAN achieves 28.47% improvement in terms of Inception Score on CUB dataset (from 2.88 to 3.70), and 20.30% improvement on Oxford-102 (from 2.66 to 3.20).

2.3. Ablation Studies

StackGAN Variants

Among all variants, 256×256 StackGAN does add more details into the larger images, and obtains the best Inception Score.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.