Review — BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis

BigGAN & BigGAN-deep, Generates High-Resolution Images

Sik-Ho Tsang
5 min readAug 22, 2023
Class-conditional samples generated by BigGAN

Large Scale GAN Training for High Fidelity Natural Image Synthesis,
BigGAN, BigGAN-deep, by Heriot-Watt University, and DeepMind,
2019 ICLR, Over 4500 Citations (Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)
Image Synthesis: 2014 2019 [SAGAN] 2020 [GAN Overview]
==== My Other Paper Readings Are Also Over Here ====

  • SAGAN is proposed by scaling up GAN, with also enhanced architecture: shared class embeddings and noise Vector Skip connection.
  • Orthogonal regularization is applied to improve the generator performance. A simple truncation trick is proposed to allow fine control over the trade-off between sample fidelity and variety.
  • 128×128, 256×256, 512×512 high resolution (at that moment) ImageNet images can be generated.


  1. BigGAN: Scaling Up GAN & Enhanced Architecture
  2. BigGAN: Orthogonal Regularization & Truncation Trick
  3. Results

1. BigGAN: Scaling Up GAN & Enhanced Architecture

Scaling Up SAGAN With Also Different Techniques

1.1. Baseline

  • SAGAN with hinge loss is used as baseline. Class information is provided to G with class-conditional Batch Norm and provided to D with projection.
  • The optimization settings follow SAGAN (notably employing Spectral Norm in G) with the modification that we halve the learning rates and take two D steps per G step. Moving averages of G’s weights are used during evaluation.
  • Progressive growing in Progressive GAN is found to be NOT necessary.

1.2. Scaling Up

Rows 1–4 of Table 1: Simply increasing the batch size by a factor of 8 improves the state-of-the-art Inception Score (IS) by 46%. This is a result of each batch covering more modes, providing better gradients for both networks.

  • But it may become unstable and undergo complete training collapse. Scores are obtained from checkpoints saved just before collapse.

Row 5: Then the width (number of channels) in each layer is increased by 50%. This leads to a further IS improvement of 21% due to the increased capacity of the model relative to the complexity of the dataset.

1.3. Shared Class Embeddings: Rows 6–9 (Shared)

Row 6 (Shared): Class embeddings c is used for the conditional Batch Norm layers in G contain a large number of weights.

Instead of having a separate layer for each embedding, a shared embedding is used, which is linearly projected to each layer’s gains and biases.

  • This reduces computation and memory costs, and improves training speed (in number of iterations required to reach a given performance) by 37%.

1.4. Noise Vector Skip Connection: Rows 7–9 (Skip-z)

Row 7 (Skip-z): Add direct skip connections from the noise vector z to multiple layers of G rather than just the initial layer. The intuition behind this design is to allow G to use the latent space to directly influence features at different resolutions and levels of hierarchy.

  • In BigGAN, this is accomplished by splitting z into one chunk per resolution, and concatenating each chunk to the conditional vector c which gets projected to the Batch Norm gains and biases.
  • In BigGAN-deep, an even simpler design is used, concatenating the entire z with the conditional vector without splitting it into chunks.

Skip-z provides a modest performance improvement of around 4%, and improves training speed by a further 18%.

1.4. BigGAN Model Architecture

BigGAN Model Architecture (Paper Appendix)

1.5. BigGAN-deep Model Architecture

BigGAN-deep Model Architecture (Paper Appendix)

2. BigGAN: Truncation Trick & Orthogonal Regularization

2.1. Truncation Trick

Truncation Trick
  • GANs can employ an arbitrary prior p(z), yet the vast majority of previous works have chosen to draw z from either N(0, I) or U[-1, 1].

Taking a model trained with z ~ N(0, I) and sampling z from a truncated normal (where values which fall outside a range are resampled to fall inside that range) immediately provides a boost to IS and FID.

This is called the Truncation Trick: truncating a z vector by resampling the values with magnitude above a chosen threshold leads to improvement in individual sample quality at the cost of reduction in overall sample variety.

  • Figure 2(a): As in the figure above, reducing the truncation threshold leads to a direct increase in IS (analogous to precision). FID penalizes lack of variety (analogous to recall) but also rewards precision.
  • Figure 2(b): Some of the larger models are not amenable to truncation, producing saturation artifacts.

2.2. Orthogonal Regularization (Rows 8 & 9)

  • Orthogonal Regularization (Brock et al., 2017), which directly enforces the orthogonality condition:
  • This regularization is known to often be too limiting (Miyato et al., 2018). It is modified to relax the constraint while still imparting the desired smoothness.

The diagonal terms are removed from the regularization, and it is aimed to minimize the pairwise cosine similarity between filters but does not constrain their norm:

2.3. Collapse Analysis

  • However, there are still training collapse even using the above techniques.

3. Results

3.1. ImageNet


BigGAN outperform the previous state-of-the-art IS and FID scores achieved.

3.2. JFT-300M


BigGAN works even on a much larger dataset at the same model capacity (64 base channels).

3.3. Interpolation


The proposed model convincingly interpolates between disparate samples.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.