Review: Self-Attention Generative Adversarial Networks (SAGAN)

GAN Using Transformer as Self-Attention in the form of Non-Local Neural Network

Sik-Ho Tsang
6 min readDec 1, 2021
The proposed SAGAN generates images by leveraging complementary features in distant portions of the image rather than local regions of fixed shape to generate consistent objects/scenarios

In this story, Self-Attention Generative Adversarial Networks, (SAGAN), by Rutgers University, and Google Brain, is reviewed. This is a paper from Ian Goodfellow (The inventor of GAN). In this paper:

  • SAGAN is proposed, which allows attention-driven, long-range dependency modeling for image generation tasks.
  • The discriminator can check that highly detailed features in distant portions of the image are consistent with each other.
  • Spectral normalization to the GAN generator and discriminator.

This is a paper in 2019 ICML with over 2100 citations. (Sik-Ho Tsang @ Medium)


  1. Self-Attention Generative Adversarial Network (SAGAN)
  2. Techniques to Stabilize GAN Training
  3. Experimental Results

1. Self-Attention Generative Adversarial Network (SAGAN)

The proposed self-attention mechanism
  • In SAGAN, it adopts the Transformer that is in the form of Non-Local Neural Network’s one.
  • The image features from the previous hidden layer x are first transformed into two feature spaces f, g to calculate the attention:
  • βj,i indicates the extent to which the model attends to the ith location when synthesizing the jth region. Then the output of the attention layer is o = (o1, o2, … , oj, …, oN):
  • Wf, Wg and Wh are the learned weight matrices, which are implemented as 1×1 convolutions.
  • The output of the attention layer is further multiplied by a scale parameter and added back the input feature map. Therefore, the final output is given by:
  • (Please feel free to read Transformer and Non-Local Neural Networks.)
  • In SAGAN, the proposed attention module has been applied to both generator and discriminator, which are trained in an alternating fashion by minimizing the hinge version of the adversarial loss:

2. Techniques to Stabilize GAN Training

  • Two techniques are used.

2.1. Spectral Normalization (SN)

  • First, spectral normalization in SNGAN is used in the generator as well as in the discriminator.
  • Doing so constrains the Lipschitz constant by restricting the spectral norm of each layer.
  • Spectral normalization does not require extra hyper-parameter tuning (setting the spectral norm of all weight layers to 1 consistently performs well in practice).
  • It is found that spectral normalization in the generator can prevent the escalation of parameter magnitudes and avoid unusual gradients.

2.2. Two-Timescale Update Rule (TTUR)

  • Second, the two-timescale update rule (TTUR) is effective to address slow learning in regularized discriminators.
  • In previous work, regularization of the discriminator [16, 7] often slows down the GAN learning process. In practice, methods using regularized discriminators typically require multiple (e.g., 5) discriminator update steps per generator update step during training.
  • Using separate learning rate, TTUR, specifically to compensate for the problem of slow learning in a regularized discriminator.

3. Experimental Results

3.1. Evaluating the Proposed Stabilization Techniques

Training curves for the baseline model and models with the proposed stabilization techniques
  • In the baseline model, SN is only utilized in the discriminator. When we train it with 1:1 balanced updates for the discriminator (D) and the generator (G), the training becomes very unstable. It exhibits mode collapse very early in training.
  • As shown in the middle sub-figures, adding SN to both the generator and the discriminator greatly stabilized the model “SN on G/D”. For example, the image quality as measured by Fréchet Inception distance (FID) and Inception score (IS) is starting to drop at the 260k-th iteration.
  • When applying the imbalanced learning rates to train the discriminator and the generator, the quality of images generated by the modelSN on G/D+TTUR” improves monotonically during the whole training process.
128×128 examples randomly generated by the baseline model and our models “SN on G/D” and “SN on G/D+TTUR

Example images randomly generated by this model at different iterations can be found in the above figure.

3.2. Self-Attention Mechanism

Comparison of Self-Attention and Residual block on GANs
  • Several SAGAN models are built by adding the self-attention mechanism to different stages of the generator and discriminator.

SAGAN models with the self-attention mechanism at the middle-to-high level feature maps (e.g., feat32 and feat64) achieve better performance than the models with the self-attention mechanism at the low level feature maps (e.g., feat8 and feat16).

  • For example, the FID of the model “SAGAN, feat8” is improved from 22.98 to 18.28 by “SAGAN, feat32”.

The attention mechanism gives more power to both generator and discriminator to directly model the long-range dependencies in the feature maps.

  • Compared with residual blocks with the same number of parameters, the self-attention blocks also achieve better results. For example, the training is not stable when we replace the self-attention block with the residual block in 8×8 feature maps, which leads to a significant decrease in performance (e.g.: FID increases from 22.98 to 42.13).
  • This comparison demonstrates that the performance improvement given by using SAGAN is not simply due to an increase in model depth and capacity.
Visualization of attention maps
  • In each cell, the first image shows three representative query locations with color coded dots.

For example, in the top-left cell, the red point attends mostly to the body of the bird around it, however, the green point learns to attend to other side of the image. In this way, the image has a consistent background.

  • Similarly, the blue point allocates the attention to the whole tail of the bird to make the generated part coherent. Those long-range dependencies could not be captured by convolutions with local receptive fields.
  • As shown in the top-right cell, SAGAN is able to draw dogs with clearly separated legs. The blue query point shows that attention helps to get the structure of the joint area correct.

3.3. SOTA Comparison

Comparison of the proposed SAGAN with state-of-the-art GAN models [19, 17] for class conditional image generation on ImageNet

As shown in the above table, the proposed SAGAN achieves the best Inception score and FID.

SAGAN can better approximate the original image distribution by using the self-attention module to model the global dependencies between image regions.

128×128 example images generated by SAGAN for different classes. Each row shows samples from one class
  • The above figure shows some sample images generated by SAGAN.


[2019 ICML] [SAGAN]
Self-Attention Generative Adversarial Networks

Generative Adversarial Network (GAN)

Image Synthesis: 2014 [GAN] [CGAN] 2015 [LAPGAN] 2016 [AAE] [DCGAN] [CoGAN] [VAE-GAN] [InfoGAN] 2017 [SimGAN] [BiGAN] [ALI] [LSGAN] [EBGAN] 2019 [SAGAN]
Image-to-image Translation: 2017 [Pix2Pix] [UNIT] [CycleGAN] 2018 [MUNIT]
Super Resolution: 2017 [SRGAN & SRResNet] [EnhanceNet] 2018 [ESRGAN]
Blur Detection: 2019 [DMENet]
Camera Tampering Detection: 2019 [Mantini’s VISAPP’19]
Video Coding: 2018
[VC-LAPGAN] 2020 [Zhu TMM’20] 2021 [Zhong ELECGJ’21]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.