Review — BigBiGAN: Large Scale Adversarial Representation Learning
Generative Adversarial Network (GAN)
Image Synthesis: 2014 … 2019 [SAGAN] [BigGAN] 2020 [GAN Overview]
==== My Other Paper Readings Are Also Over Here ====
- BigBiGAN is proposed, which is built upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator.
- It achieves the SOTA in unsupervised (self-supervised) representation learning on ImageNet, as well as in unconditional image generation.
- BigBiGAN is used as comparison for many SSL papers.
1.1. Addition of Encoder E
- Given a distribution Px of data x (e.g., images), and a distribution Pz of latents z (normally N(0, 1)), the generator G models a conditional distribution P(x|z) of data x given latent inputs z sampled from the latent prior Pz, as in the standard GAN generator.
- Compared to BigGAN, an encoder E is added.
The encoder E models the inverse conditional distribution P(z|x), predicting latents z given data x sampled from the data distribution Px.
1.2. Joint Discriminator D
- Besides the addition of E, the other modification to the GAN in the BiGAN framework is a joint discriminator D, which takes as input data-latent pairs (x, z) (rather than just data x as in a standard GAN), and learns to discriminate between pairs from the data distribution and encoder, versus the generator and latent distribution.
- Concretely, its inputs are pairs (x~Px, ^z~E(x)) and (^x~G(z), z~Pz), and the goal of the G and E is to “fool” the discriminator by making the two joint distributions PxE and PGz from which these pairs are sampled indistinguishable.
- The adversarial minimax objective in ALI, analogous to that of the GAN framework, was defined as follows:
1.3. Addition of Unary Terms
Additional unary terms are used in the learning objective, which are functions only of either the data x or the latents z. These unary terms intuitively guide optimization in the “right direction” by explicitly enforcing this property.
- Concretely, the discriminator loss LD and the encoder-generator loss LEG are defined as follows, based on scalar discriminator “score” functions s and the corresponding per-sample losses l*:
- where h(t) = max(0; 1-t) is a “hinge” used to regularize the discriminator.
- The discriminator D includes three submodules: F, H, and J.
- F is a ConvNet and H is an MLP. J is a function of the outputs of F and H.
2.1. Ablation Studies
- Latent distribution Pz and stochastic E (Var): Instead of using z directly, the final z uses the reparametrized sampling, with z=μ+εσ, where ε~N(0, I). This non-deterministic Base model achieves significantly better classification performance.
- Unary loss terms (sx, sz): The x unary term has a large positive effect on generation performance, with the Base and x Unary Only rows having significantly better IS and FID than the z Unary Only and No Unaries rows.
- G capacity: A powerful image generator is indeed important for learning good representations via the encoder. Assuming this relationship holds in the future, we expect that better generative models are likely to lead to further improvements in representation learning.
- Bidirection: With an enhanced E taking higher input resolutions, generation with BigBiGAN in terms of FID is substantially improved over the standard GAN.
- High resolution E with varying resolution G: BigBiGAN achieves better representation learning results as the G resolution increases, up to the full E resolution of 256×256. But the overall model is much slower to train. The remainder uses the 128×128 resolution for G only.
- E architecture: Improvements are observed from RevNet-50, with double-width RevNet outperforming a ResNet of the same capacity (rows RevNet 2 and ResNet 2). We see further gains with an even larger quadruple-width RevNet model (row RevNet 4), which is used for the final results.
- Decoupled E/G optimization: The E optimizer is decoupled from that of G, and found that simply using a 10× higher learning rate for E dramatically accelerates training.
2.2. Unsupervised ImageNet
BigBiGAN approach based purely on generative models performs well for representation learning, state-of-the-art among recent unsupervised learning results, improving upon a recently published result from RotNet (Rotation) of 55.4% to 60.8% top-1 accuracy using rotation prediction pre-training with the same representation learning architecture
- It also matches the results of the concurrent work in CPCv2 based on contrastic predictive coding (CPC, CPCv1).
2.3. Unconditional Image Generation
2.4. Image Reconstruction
These reconstructions are far from pixel-perfect, likely due in part to the fact that no reconstruction cost is explicitly enforced by the objective — reconstructions are not even computed at training time. However, they may provide some intuition for what features the encoder E learns to model.
For example, when the input image contains a dog, person, or a food item, the reconstruction is often a different instance of the same “category” with similar pose, position, and texture.