Review: DCGAN — Deep Convolutional Generative Adversarial Network (GAN)

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks


  1. A Set of Constraints for Stable Training
  2. Network Architecture
  3. Experimental Results

1. A Set of Constraints for Stable Training

1.1. All convolutional net replaces deterministic spatial pooling functions (such as max pooling) with strided convolutions.

  • The generator learns its own spatial downsampling itself using convolution.
  • Similarly, the discriminator learns its own spatial upsampling.

1.2. Eliminating Fully Connected Layers

  • The first layer of the generator, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack.
  • For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output.
  • But there are no fully connected layers for all hidden layers.

1.3. Batch Normalization (BN)

  • BN stabilizes learning by normalizing the input to each unit to have zero mean and unit variance.
  • However, directly applying BN to all layers resulted in sample oscillation and model instability.
  • This was avoided by not applying BN to the generator output layer and the discriminator input layer.

1.4. Activation Functions

  • The ReLU activation is used in the generator with the exception of the output layer which uses the Tanh function.
  • Within the discriminator, it is found that the leaky rectified activation (LeakyReLU) works well.

2. Network Architecture

  • The generator is as shown above. Only convolution, no fully connected layers according to the constraints.
Discriminator (Image from
  • The discriminator is just like a inverse of generator.

3. Experimental Results

  • Three datasets are trained: LSUN, ImageNet, and a newly assembled faces dataset.

3.1. LSUN

Generated bedrooms after one epoch
Generated bedrooms after five epochs of training.
  • A model is trained on the LSUN bedrooms dataset containing a little over 3 million training examples.
  • The model is not producing high quality samples via simply overfitting/memorizing training examples.

3.2. DCGAN Trained on ImageNet, Tested on CIFAR10 & SVHN

Accuracy (%) on CIFAR10
  • DCGAN is trained on ImageNet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a 4×4 spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them.
  • This achieves 82.8% accuracy, outperforms all K-means based approaches.
Error Rate (%) on SVHN
  • A purely supervised CNN is trained but only 28.87% error rate is obtained.
  • But using DCGAN, the same CNN, 22.8% error rate is obtained.

3.3. Walking in the Latent Space

Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions
  • The vector Z actually is a n dimensional vector in a n dimensional space.
  • If interpolation is performed between 2 z vectors, a gradual change can be seen as shown above.
  • Walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations.

3.4. Visualizing the Discriminator Features

Left: Random Filter Baseline, Right: Trained Filters
  • The above figure shows that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows.

3.5. Forgetting to Draw Certain Objects

First Row: Models Without Dropping “Window” Filters, Second Row: Models With Dropping “Window” Filters
  • By dropping “Window” filter, some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors.
  • Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation.

3.6. Vector Arithmetic on Face Samples

Smiling Woman — Neutral Woman + Neutral Man = Smiling Man
Man With Glasses — Man Without Glasses + Woman Without Glasses = Woman With Glasses
  • Simple arithmetic operations revealed rich linear structure in representation space.
  • e.g.: vector(”King”) — vector(”Man”) + vector(”Woman”) can result in a vector whose nearest neighbor was the vector for Queen.
  • Experiments working on only single samples per concept were unstable, but averaging the Z vector for three exemplars showed consistent and stable generations that semantically obeyed the arithmetic.
  • As shown in the figures above, for each column, the Z vectors of samples are averaged. Arithmetic was then performed on the mean vectors creating a new vector Y. This Y is fed into the generator as input.
  • A uniform noise sampled with scale ±0.25 was added to Y to produce the other 8 samples.
A ”turn” vector was created from four averaged samples of faces looking left vs looking right.
  • Face pose can also be modeled linearly in Z space.
Vector Arithmetic on Input Space
  • In contrast, vector arithmetic on input space obtains poor results.



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store