Review: DCGAN — Deep Convolutional Generative Adversarial Network (GAN)

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

5 min readApr 20, 2020

--

In this story, Deep Convolutional Generative Adversarial Network (DCGAN), by Indico Research and Facebook AI Research (FAIR), is reviewed. With DCGAN, a hierarchy of representations is learnt from object parts to scenes in both the generator and discriminator. This is a paper in 2016 ICLR with about 6000 citations. (Sik-Ho Tsang @ Medium)

(During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 16th story in the month of April. Yet, today is 20th April 2020. The schedule is a little bit lagging behind. But I wish I can accomplish it, though some stories are short and are more related to my research work, i.e. video coding/compression, which is not the mainstream of deep learning development … lol)

Outline

A Set of Constraints for Stable Training
Network Architecture
Experimental Results

1. A Set of Constraints for Stable Training

1.1. All convolutional net replaces deterministic spatial pooling functions (such as max pooling) with strided convolutions.

The generator learns its own spatial downsampling itself using convolution.
Similarly, the discriminator learns its own spatial upsampling.

1.2. Eliminating Fully Connected Layers

The first layer of the generator, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack.
For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output.
But there are no fully connected layers for all hidden layers.

1.3. Batch Normalization (BN)

BN stabilizes learning by normalizing the input to each unit to have zero mean and unit variance.
However, directly applying BN to all layers resulted in sample oscillation and model instability.
This was avoided by not applying BN to the generator output layer and the discriminator input layer.

1.4. Activation Functions

The ReLU activation is used in the generator with the exception of the output layer which uses the Tanh function.
Within the discriminator, it is found that the leaky rectified activation (LeakyReLU) works well.

2. Network Architecture

The generator is as shown above. Only convolution, no fully connected layers according to the constraints.

**Discriminator (Image from** https://mc.ai/a-friendly-introduction-to-dcgan/)

The discriminator is just like a inverse of generator.

3. Experimental Results

Three datasets are trained: LSUN, ImageNet, and a newly assembled faces dataset.

3.1. LSUN

**Generated bedrooms after five epochs of training.**

A model is trained on the LSUN bedrooms dataset containing a little over 3 million training examples.
The model is not producing high quality samples via simply overfitting/memorizing training examples.

3.2. DCGAN Trained on ImageNet, Tested on CIFAR10 & SVHN

DCGAN is trained on ImageNet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a 4×4 spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them.
This achieves 82.8% accuracy, outperforms all K-means based approaches.

A purely supervised CNN is trained but only 28.87% error rate is obtained.
But using DCGAN, the same CNN, 22.8% error rate is obtained.

3.3. Walking in the Latent Space

**Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions**

The vector Z actually is a n dimensional vector in a n dimensional space.
If interpolation is performed between 2 z vectors, a gradual change can be seen as shown above.
Walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations.

3.4. Visualizing the Discriminator Features

**Left: Random Filter Baseline, Right: Trained Filters**

The above figure shows that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows.

3.5. Forgetting to Draw Certain Objects

**First Row: Models Without Dropping “Window” Filters, Second Row: Models With Dropping “Window” Filters**

By dropping “Window” filter, some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors.
Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation.

3.6. Vector Arithmetic on Face Samples

**Smiling Woman — Neutral Woman + Neutral Man = Smiling Man**

**Man With Glasses — Man Without Glasses + Woman Without Glasses = Woman With Glasses**

Simple arithmetic operations revealed rich linear structure in representation space.
e.g.: vector(”King”) — vector(”Man”) + vector(”Woman”) can result in a vector whose nearest neighbor was the vector for Queen.
Experiments working on only single samples per concept were unstable, but averaging the Z vector for three exemplars showed consistent and stable generations that semantically obeyed the arithmetic.
As shown in the figures above, for each column, the Z vectors of samples are averaged. Arithmetic was then performed on the mean vectors creating a new vector Y. This Y is fed into the generator as input.
A uniform noise sampled with scale ±0.25 was added to Y to produce the other 8 samples.