Brief Review — StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks

StyleGAN, GAN for Style Transfer

4 min readNov 19, 2023

A Style-Based Generator Architecture for Generative Adversarial Networks
StyleGAN, by NVIDIA
2019 CVPR, Over 8700 Citations (Sik-Ho Tsang @ Medium)
Generative Adversarial Network (GAN)
Style Transfer: 2016 [GAN-CLS, GAN-INT, GAN-CLS-INT]
==== My Other Paper Readings Are Also Over Here ====

StyleGAN is proposed, which leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis.
Two new automated methods are also proposed to quantify interpolation quality and disentanglement, that are applicable to any generator architecture.
Later, there is also StyleGAN2, StyleGAN3, and StyleGAN-XL.

Outline

StyleGAN Architecture
Disentanglement Studies
Qualitative Results

1. StyleGAN Architecture

**StyleGAN Incremental Improvement From** **Progressive GAN**

(A) Progressive GAN is improved by adding B to F.

1.1. (A) Progressive GAN

Starting from Progressive GAN, StyleGAN is achieved by incrementally added the components above from A to F, for various generator architectures in CELEBA-HQ [30] and the proposed new FFHQ dataset.

1.2. (B) An Improved Baseline

An improved baseline is achieved by using bilinear up/downsampling operations [64], longer training, and tuned hyperparameters.

1.3. (C) Adaptive Instance Normalization (AdaIN)

Learned affine transformations then specialize w to styles y = (ys, yb) that control adaptive instance normalization (AdaIN) [27, 17, 21, 16] operations after each convolution layer of the synthesis network g.

The AdaIN operation is defined as:

Similar architecture is also used in Style Transfer.

1.4. (D) Remove Traditional Input

It is observed that the network no longer benefits from feeding the latent code into the first convolution layer.

Therefore the architecture is simplified by removing the traditional input layer and starting the image synthesis from a learned 4×4×512 constant tensor.

1.5. (E) Add Noise Input

**Effect of Noise Inputs at Different Layers**

The noise inputs improve the results further.

The noise image is broadcasted to all feature maps using learned per-feature scaling factors and then added to the output of the corresponding convolution.

The philosophy behind is that given that the only input to the network is through the input layer, the network needs to invent a way to generate spatially-varying pseudorandom numbers from earlier activations whenever they are needed. This consumes network capacity and hiding the periodicity of generated signal is difficult.

The noise affects only the stochastic aspects, leaving the overall composition and high-level aspects such as identity intact.

1.6. (F) Mixing Regularization

Novel mixing regularization is proposed to decorrelate neighboring styles and enables more fine-grained control over the generated imagery.

2. Disentanglement Studies

Interpolation of latent-space vectors may yield surprisingly non-linear changes in the image. For example, features that are absent in either endpoint may appear in the middle of a linear interpolation path.

Intuitively, a less curved latent space should result in perceptually smoother transition than a highly curved latent space.

If we subdivide a latent space interpolation path into linear segments, we can define the total perceptual length of this segmented path as the sum of perceptual differences over each segment, as reported by the image distance metric.

where z1, z2~P(z), t~U(0,1), G is the generator, d is the perceptual distance, slerp is spherical interpolation.
Computing the average perceptual path length in W is carried out in a similar fashion:

The only difference is that interpolation happens in W space, and linear interpolation (lerp) is used.

**Perceptual path lengths and separability scores for various generator architectures in FFHQ (lower is better).**

Full-path length is substantially shorter for the proposed style-based generator with noise inputs, indicating that W is perceptually more linear than Z.

W is consistently better separable than Z, suggesting a less entangled representation. Furthermore, increasing the depth of the mapping network improves both image quality and separability in W.

3. Qualitative Results

3.1. FFHQ Dataset

3.2. Truncation Trick

If we consider the distribution of training data, it is clear that areas of low density are poorly represented and thus likely to be difficult for the generator to learn.