Review — EBGAN: Energy-Based Generative Adversarial Network (GAN)

Using Autoencoder at Discriminator, Using Repelling Regularizer at Generator

Sik-Ho Tsang
6 min readJul 31, 2021
EBGAN: low energies to the regions near the data manifold and higher energies to other regions. (Figure from https://www.slideshare.net/MingukKang/ebgan)

In this story, Energy-based Generative Adversarial Network, (EBGAN), by New York University, and Facebook Artificial Intelligence Research, is briefly reviewed. In this paper:

  • EBGAN views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions.
  • Similar to the probabilistic GANs, the generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples.

This is a paper in 2017 ICLR with over 1000 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Energy-Based Model
  2. Loss Functions
  3. Autoencoder Used At Discriminator
  4. Experimental Results

1. Energy-Based Model

  • Supervised learning falls into this framework: for each X in the training set, the energy of the pair (X, Y) takes low values when Y is the correct label and higher values for incorrect Y’s.
  • Similarly, when modeling X alone within an unsupervised learning setting, lower energy is attributed to the data manifold.
  • The term contrastive sample is often used to refer to a data point causing an energy pull-up, such as the incorrect Y ’s in supervised learning and points from low data density regions in unsupervised learning.
  • (Btw, contrastive learning is crucial in self-supervised learning.)

2. Loss Functions

Loss Functions (Figure from https://www.slideshare.net/MingukKang/ebgan)
  • The output of the discriminator goes through an objective functional in order to shape the energy function, attributing low energy to the real data samples and higher energy to the generated (“fake”) ones.
  • Two different losses, one to train D and the other to train G.
  • In order to get better quality gradients when the generator is far from convergence.
  • Given a positive margin m, a data sample x and a generated sample G(z), the discriminator loss LD and the generator loss LG are formally defined by:
  • where [a]+ = max(0,a).
  • Minimizing LG with respect to the parameters of G is similar to maximizing the second term of LD.
  • It has the same minimum but non-zero gradients when D(G(z)) ≥ m.
  • When D(G(z)) ≥ m, there is a larger LD loss, where m is a hyperparameter.
  • If the system reaches a Nash equilibrium, then the generator G produces samples that are indistinguishable from the distribution of the dataset.
  • (There is mathematical proof for the optimality of the solution. Please feel free to read the paper.)

3. Autoencoder Used At Discriminator

EBGAN architecture with an auto-encoder discriminator

3.1. Reasons of Using Autoencoder

  • In EBGAN, the discriminator D is structured as an auto-encoder.
  • Rather than using a single bit of target information to train the model, the reconstruction-based output offers a diverse targets for the discriminator.
  • With the binary logistic loss, only two targets are possible, so within a minibatch.
  • On the other hand, the reconstruction loss will likely produce very different gradient directions within the minibatch.
  • When trained with some regularization terms, auto-encoders have the ability to learn an energy manifold without supervision or negative examples.
  • Even when an EBGAN auto-encoding model is trained to reconstruct a real sample, the discriminator contributes to discovering the data manifold by itself.

3.2. EBGAN-PT: Repelling Regularizer

  • One common issue in training auto-encoders is that the model may learn little more than an identity function, meaning that it attributes zero energy to the whole space.
  • The model must be pushed to give higher energy to points outside the data manifold.
  • The proposed repelling regularizer is to purposely keep the model from producing samples that are clustered in one or only few modes of pdata. It involves a Pulling-away Term (PT) that runs at a representation level.
  • Formally, let S denotes a batch of sample representations taken from the encoder output layer. Cosine distance/similarity is used:
  • PT operates on a mini-batch and attempts to orthogonalize the pairwise sample representation.
  • This is denoted as “EBGAN-PT”. Note the PT is used in the generator loss but not in the discriminator loss.

4. Experimental Results

4.1. MNIST Generation

Grid search specs
  • Some hyperparameters are found by Grid Search for find the best GAN and EBGAN.
Generation from the grid search on MNIST, Left: Best GAN, Middle: Best EBGAN, Right: Best EBGAN-PT
  • The above generated MNIST are with the best inception score. GAN one has some noise or unrecognized digits generated.
  • EBGAN and EBGAN-PT have a better results.

4.2. Semi-Supervised Learning on PI-MNIST

The comparison of LN bottom-layer-cost model and its EBGAN extension on PI-MNIST semi-supervised task.
  • The potential of using the EBGAN framework for semi-supervised learning is shown on permutation-invariant MNIST (PI-MINST), collectively on using 100, 200 and 1000 labels.
  • It is found to be crucial in enabling EBGAN framework for semi-supervised learning is to gradually decay the margin value m of the first equation. The rationale behind is to let discriminator punish generator less when pG gets closer to the data manifold.
  • This margin decaying schedule is found by hyperparameter search.
  • The contrastive samples can be thought as an extension to the dataset that provides more information to the classifier.
  • Using Ladder Network (LD) as baseline, with EBGAN, large amount of improvement is achieved.

4.3 LSUN & CELEBA

LSUN: left: DCGAN, Right: EBGAN-PT
CELEBA: left: DCGAN, Right: EBGAN-PT
  • EBGAN framework is used with deep convolutional architecture to generate 64×64 RGB images.

4.4. ImageNet

ImageNet 128×128 generations using an EBGAN-PT.
ImageNet 256×256 generations using an EBGAN-PT.
  • Compared with the datasets we have experimented so far, ImageNet presents an extensively larger and wilder space, so modeling the data distribution by a generative model becomes very challenging.
  • Despite the difficulty of generating images on a high-resolution level, EBGANs are able to learn about the fact that objects appear in the foreground, together with various background components resembling grass texture, sea under the horizon, mirrored mountain in the water, buildings, etc.
  • In addition, the 256×256 dog-breed generations, although far from realistic, do reflect some knowledge about the appearances of dogs such as their body, furs and eye.

Reference

[2017 ICLR] [EBGAN]
Energy-based Generative Adversarial Network

Some Figures: https://www.slideshare.net/MingukKang/ebgan

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [AAE] [DCGAN] [CoGAN] [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN]
Image-to-image Translation [Pix2Pix] [UNIT] [CycleGAN] [MUNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding
[VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.