Review — SimGAN: Learning from Simulated and Unsupervised Images through Adversarial Training (GAN)

Synthetic Images Become More Realistic

8 min readApr 17, 2021

**Simulated+Unsupervised (S+U) Learning in SimGAN**

In this story, Learning from Simulated and Unsupervised Images through Adversarial Training, (SimGAN), by Apple Inc., is reviewed. In this paper:

Simulated+Unsupervised (S+U) learning is proposed, where the task is to learn a model to improve the realism of a simulator’s output using unlabeled real data, while preserving the annotation information from the simulator.
Synthetic images are used as inputs instead of random vectors.
A ‘self-regularization’ term, a local adversarial loss, and update of the discriminator using a history of refined images, are also suggested.

This is a paper in 2017 CVPR with over 1300 citations. (Sik-Ho Tsang @ Medium)

Outline

Overview of SimGAN
Adversarial Loss with Self-Regularization
Local Adversarial Loss
Updating Discriminator using a History of Refined Images
Experimental Results

1. Overview of SimGAN

The output of the simulator is refined with a refiner neural network, R, that minimizes the combination of a local adversarial loss and a ‘self-regularization’ term.
The adversarial loss ‘fools’ a discriminator network, D, that classifies an image as real or refined.
The self-regularization term minimizes the image difference between the synthetic and the refined images.
The refiner network and the discriminator network are updated alternately.

Simulated+Unsupervised (S+U) learning, not only adding realism to the synthetic images, but also preserving the annotation.
e.g.: the gaze direction should be preserved after refinement.

2. Adversarial Loss with Self-Regularization

2.1. Discriminator D

We got a set of unlabeled real images yi∈Y to learn a refiner Rθ(x) that refines a synthetic image x, and ~x is the refined image.
The discriminator network, D, is trained to classify the images as real vs refined.
Thus, for discriminator, it is the cross-entropy error for a two class classification problem:

where DΦ(.) is the probability of the input being a synthetic image, 1 - DΦ(.) that of a real one.

2.2. Refiner R (Generator)

A refiner Rθ(x) that refines a synthetic image x, where θ is the parameters of refiner network R.

where ~x is the refined image. The refined image ~x should look like a real image in appearance while preserving the annotation information from the simulator.
The θ, the parameters of refiner network R, is learnt minimizing a combination of two losses:

The first part of the cost, lreal, adds realism to the synthetic images, while the second part, lreg, preserves the annotation information.
To add realism to the synthetic image, we need to bridge the gap between the distributions of synthetic and real images. To do this, the realism loss function lreal is:

By minimizing this loss function, the refiner forces the discriminator to fail classifying the refined images as synthetic.

In addition to generating realistic images, the refiner network should preserve the annotation information of the simulator.
For example, for gaze estimation the learned transformation should not change the gaze direction, and for hand pose estimation the location of the joints should not change.

A self-regularization loss, that minimizes per-pixel difference between a feature transform of the synthetic and refined images, is used:

where Ψ is the mapping from image space to a feature space, and l1 norm is used.
The feature transform Ψ can be identity mapping, image derivatives, mean of color channels, or a learned transformation.
Thus, the overall refiner loss function becomes:

Rθ is as a fully convolutional neural net without striding or pooling, modifying the synthetic image on a pixel level, rather than holistically modifying the image content as in e.g. a fully connected encoder network, thus preserving the global structure and annotations.

3. Local Adversarial Loss

**Illustration of local adversarial loss**

When training a single strong discriminator network, the refiner network tends to over-emphasize certain image features to fool the current discriminator network, leading to drifting and producing artifacts.
Yet, any local patch sampled from the refined image should have similar statistics to a real image patch.
Therefore, rather than defining a global discriminator network, a discriminator network should be defined that classifies all local image patches separately.
The discriminator D is designed to be a fully convolutional network that outputs w×h dimensional probability map belonging to the fake class, where w×h are the number of local patches in the image.

The adversarial loss function is the sum of the cross-entropy losses over the local patches.

4. Updating Discriminator using a History of Refined Images

**Illustration of using a history of refined images**

Another problem of adversarial training is that the discriminator network only focuses on the latest refined images.
This lack of memory may cause (i) divergence of the adversarial training, and (ii) the refiner network re-introducing the artifacts that the discriminator has forgotten about.

A method to improve the stability of adversarial training is to update the discriminator using a history of refined images.

Let B be the size of the buffer and b be the mini-batch size.
The discriminator loss function is computed by sampling b/2 images from the current refiner network, and sampling an additional b/2 images from the buffer to update parameters Φ.
After each training iteration, b/2 samples in the buffer is randomly replaced with the newly generated refined images.

4. Experimental Results

4.1. Network Architecture

4.1.1. Eye Gaze Estimation

The refiner network, R, is a residual network (ResNet).
Each ResNet block consists of 2 convolutional layers containing 64 feature maps.
An input image of size 55×35 is convolved with 3×3 filters that output 64 feature maps. The output is passed through 4 ResNet blocks.
The output of the last ResNet block is passed to a 1×1 convolutional layer producing 1 feature map corresponding to the refined synthetic image.
The discriminator network, D, contains 5 convolution layers and 2 max-pooling layers.

4.1.2. Hand Pose Estimation

The architecture is the same as for eye gaze estimation, except the input image size is 224×224, filter size is 7×7, and 10 ResNet blocks are used.

4.2. Appearance-based Gaze Estimation

4.2.1. Qualitative Results

Gaze estimation is a key ingredient for many human computer interaction (HCI) tasks.
Annotating the eye images with a gaze direction vector is challenging even for humans.
Large amounts of synthetic data are needed.
The gaze estimation dataset consists of 1.2M synthetic images from the UnityEyes simulator and 214K real images from the MPIIGaze dataset.

SimGAN can help to produce more realistic samples.

**Example output of SimGAN for the UnityEyes gaze estimation dataset**

The above figure shows examples of synthetic, real and refined images from the eye gaze dataset.

SimGAN successfully captures the skin texture, sensor noise and the appearance of the iris region in the real images. Also, SimGAN preserves the annotation information (gaze direction) while improving the realism.

4.2.2. Self-regularization in Feature Space

**Self-regularization in feature space for color images**

When the synthetic and real images have significant shift in the distribution, a pixel-wise L1 difference may be restrictive.
Mean of RGB channels is used instead of identity map as the feature transform function.

As shown, the network trained using this feature transform is able to generate realistic color images.

4.2.3. Using a History of Refined Images for Updating the Discriminator

This results in an increased gaze estimation error of 12.2 degrees without the history, in comparison to 7.8 degrees with the history.

4.2.4. Visual Turing Test

**Results of the ‘Visual Turing test’ user study for classifying real vs refined images**

Each subject was shown a random selection of 50 real images and 50 refined images in a random order.
10 subjects chose the correct label 517 times out of 1000 trials (p = 0.148), meaning they were not able to reliably distinguish real images from synthetic.

The average human classification accuracy was 51.7% (chance = 50%).

4.2.5. Quantitative Results

**Comparison of SimGAN to the state-of-the-art on the MPIIGaze dataset of real eyes**

Training the CNN on the refined images outperforms the state-of-the-art on the MPIIGaze dataset, with a relative improvement of 21%.

4.3. Hand Pose Estimation from Depth Images

The NYU hand pose dataset that contains 72,757 training frames and 8,251 testing frames captured by 3 Kinect cameras — one frontal and 2 side views.

4.3.1. Qualitative Results

**Example refined test images for the NYU hand pose dataset**

The main source of noise in real depth images is from depth discontinuity at the edges, which the SimGAN is able to learn without requiring any label information.

4.3.2. Quantitative Results

**Comparison of a hand pose estimator trained on synthetic data, real data, and the output of SimGAN. (The results are at distance d = 5 pixels from ground truth.)**

A fully convolutional hand pose estimator CNN similar to Stacked Hourglass Net (Newell ECCV’16). 2 hourglass blocks are used, and a heatmap size 64×64 is outputted.

Training on refined synthetic data — the output of SimGAN which does not require any labeling for the real images — outperforms the model trained on real images with supervision, by 8.8%.

4.3.3. Importance of Using a Local Adversarial Loss

A global adversarial loss uses a fully connected layer in the discriminator network, classifying the whole image as real vs refined.

The local adversarial loss removes the artifacts and makes the generated image significantly more realistic.

Reference

[2017 CVPR] [SimGAN]
Learning from Simulated and Unsupervised Images through Adversarial Training

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [CoGAN] [SimGAN]
Image-to-image Translation [Pix2Pix] [UNIT]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding [VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]