# Brief Review — StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

## StackGAN++, Introduced StackGAN-v1 & StackGAN-v2

--

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial NetworksStackGAN++, StackGAN-v2, by Rutgers University, Lehigh University, The Chinese University of Hong Kong, and University of North Carolina at Charlotte2018 TPAMI, Over 1200 Citations(Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)Image Synthesis:2014…2019[SAGAN]

==== My Other Paper Readings Are Also Over Here ====

- (StackGAN-v1 is proposed which is essentially the same as the one published in 2017 ICCV. In this story, I will focus on StackGAN-v2, which is proposed for both conditional and unconditional generative tasks.)
**StackGAN-v2**consists of**multiple generators and multiple discriminators**arranged in a**tree-like structure**; images at multiple scales corresponding to the same scene are generated from different branches of the tree. StackGAN-v2 shows**more stable training behavior than****StackGAN-v1**by**jointly approximating multiple distributions**.

# Outline

**StackGAN-v2****Results**

**1. StackGAN-v2**

## 1.1. Model Framework

StackGAN-v2framework has atree-like structure, whichtakes a noise vector z ~ pnoise as the inputand hasmultiple generators to produce images of different scales.

- The hidden features
*hi*are computed for each generator Gi by a non-linear transformation:

- where
*hi*represents hidden features for the ith branch,*m*is the total number of branches, and*Fi*are modeled as neural networks. - The
**noise vector**is*z***concatenated**to the**hidden features**as the inputs of*hi*-1*Fi*for**calculating***hi*. - At the end,
**generators produce samples of small-to-large scales (***s*0,*s*1, …,*sm*-1):

- Following each generator
*Gi*, a**discriminator**, which takes a real image*Di**xi*or a fake sample*si*as input, is trained to**classify inputs into two classes (real or fake)**by**minimizing the following cross-entropy loss:**

- The multiple discriminators are trained in parallel, and each of them focuses on a single image scale.
- Guided by the trained discriminators, the
**generators**are optimized to**jointly approximate multi-scale image distributions (**by*pdata*0,*pdata*1, …,*pdatam*-1)**minimizing the following loss function:**

- During the training process, the discriminators
*Di*and the generators*Gi*are**alternately optimized till convergence.**

The

motivationof the proposed StackGAN-v2 is that, bymodeling data distributions at multiple scales, if any one of those model distributions shares support with the real data distribution at that scale, the overlap couldprovide good gradient signal to expedite or stabilize training of the whole network at multiple scales.

For instance, approximating thelow-resolutionimagedistribution at thefirst branchresults in images withbasic color and structures.Then the generators at thesubsequent branchescan focus oncompleting detailsfor generatinghigher resolution images.

## 1.2. Joint Conditional and Unconditional Distribution Approximation

- For the generator of the
**conditional StackGAN-v2**,, such that*F*0 and*Fi*are converted to take the conditioning vector*c*as inputand*h*0=*F*0(*c*,*z*). For*hi*=*Fi*(*hi*-1,*c*)*Fi*, the conditioning vector*c*replaces the noise vector*z*. - The objective function of training the
**discriminator***Di***two terms**, the**unconditional loss**and the**conditional loss**:

- The
**unconditional loss**determines**whether the image is real or fake**while the**conditional one**determines whether the**image and the condition match or not**. Accordingly, the loss function for each generator*Gi*is converted to:

## 1.3. Color-Consistency Regularization

- Let
represent a*xk*=(*R*,*G*,*B*)T**pixel**in a generated image, then the mean and covariance of pixels of the given image can be defined by

The color-consistency regularization term aims a

t minimizing the differences ofμandσbetween different scales to encourage the consistency:

- Thus, the
**final loss**for training the*i*-th generator

for the*α*=50**unconditional task**, while it is not needed (*α*=0) for the text-to-image synthesis task.

# 2. Results

## 2.1. Inception Score & FID

For unconditional generation, the samples generated by StackGAN-v2 are consistently better than those byStackGAN-v1(last four columns in TABLE 3)

## 2.2. Visual Quality

## 2.3. Ablation Studies

The

bestStackGAN-v2 is obtained byenabling all proposed techniques.