Review — BigGAN: Large Scale GAN Training for High Fidelity Natural Image Synthesis
Large Scale GAN Training for High Fidelity Natural Image Synthesis,
BigGAN, BigGAN-deep, by Heriot-Watt University, and DeepMind,
2019 ICLR, Over 4500 Citations (Sik-Ho Tsang @ Medium)
Generative Adversarial Network (GAN)
Image Synthesis: 2014 … 2019 [SAGAN] 2020 [GAN Overview]
==== My Other Paper Readings Are Also Over Here ====
- SAGAN is proposed by scaling up GAN, with also enhanced architecture: shared class embeddings and noise Vector Skip connection.
- Orthogonal regularization is applied to improve the generator performance. A simple truncation trick is proposed to allow fine control over the trade-off between sample fidelity and variety.
- 128×128, 256×256, 512×512 high resolution (at that moment) ImageNet images can be generated.
- BigGAN: Scaling Up GAN & Enhanced Architecture
- BigGAN: Orthogonal Regularization & Truncation Trick
1. BigGAN: Scaling Up GAN & Enhanced Architecture
- SAGAN with hinge loss is used as baseline. Class information is provided to G with class-conditional Batch Norm and provided to D with projection.
- The optimization settings follow SAGAN (notably employing Spectral Norm in G) with the modification that we halve the learning rates and take two D steps per G step. Moving averages of G’s weights are used during evaluation.
- Progressive growing in Progressive GAN is found to be NOT necessary.
1.2. Scaling Up
Rows 1–4 of Table 1: Simply increasing the batch size by a factor of 8 improves the state-of-the-art Inception Score (IS) by 46%. This is a result of each batch covering more modes, providing better gradients for both networks.
- But it may become unstable and undergo complete training collapse. Scores are obtained from checkpoints saved just before collapse.
Row 5: Then the width (number of channels) in each layer is increased by 50%. This leads to a further IS improvement of 21% due to the increased capacity of the model relative to the complexity of the dataset.
1.3. Shared Class Embeddings: Rows 6–9 (Shared)
Row 6 (Shared): Class embeddings c is used for the conditional Batch Norm layers in G contain a large number of weights.
Instead of having a separate layer for each embedding, a shared embedding is used, which is linearly projected to each layer’s gains and biases.
- This reduces computation and memory costs, and improves training speed (in number of iterations required to reach a given performance) by 37%.
1.4. Noise Vector Skip Connection: Rows 7–9 (Skip-z)
Row 7 (Skip-z): Add direct skip connections from the noise vector z to multiple layers of G rather than just the initial layer. The intuition behind this design is to allow G to use the latent space to directly influence features at different resolutions and levels of hierarchy.
- In BigGAN, this is accomplished by splitting z into one chunk per resolution, and concatenating each chunk to the conditional vector c which gets projected to the Batch Norm gains and biases.
- In BigGAN-deep, an even simpler design is used, concatenating the entire z with the conditional vector without splitting it into chunks.
Skip-z provides a modest performance improvement of around 4%, and improves training speed by a further 18%.
1.4. BigGAN Model Architecture
1.5. BigGAN-deep Model Architecture
2. BigGAN: Truncation Trick & Orthogonal Regularization
2.1. Truncation Trick
- GANs can employ an arbitrary prior p(z), yet the vast majority of previous works have chosen to draw z from either N(0, I) or U[-1, 1].
Taking a model trained with z ~ N(0, I) and sampling z from a truncated normal (where values which fall outside a range are resampled to fall inside that range) immediately provides a boost to IS and FID.
This is called the Truncation Trick: truncating a z vector by resampling the values with magnitude above a chosen threshold leads to improvement in individual sample quality at the cost of reduction in overall sample variety.
- Figure 2(a): As in the figure above, reducing the truncation threshold leads to a direct increase in IS (analogous to precision). FID penalizes lack of variety (analogous to recall) but also rewards precision.
- Figure 2(b): Some of the larger models are not amenable to truncation, producing saturation artifacts.
2.2. Orthogonal Regularization (Rows 8 & 9)
- Orthogonal Regularization (Brock et al., 2017), which directly enforces the orthogonality condition:
- This regularization is known to often be too limiting (Miyato et al., 2018). It is modified to relax the constraint while still imparting the desired smoothness.
The diagonal terms are removed from the regularization, and it is aimed to minimize the pairwise cosine similarity between filters but does not constrain their norm:
2.3. Collapse Analysis
- However, there are still training collapse even using the above techniques.