Brief Review — SaGAN: Generative Adversarial Network with Spatial Attention for Face Attribute Editing

SaGAN, Focusing The Editing Place Using Mask

Sik-Ho Tsang
6 min readSep 3, 2023
Face Attribute Editing

Generative Adversarial Network with Spatial Attention for Face Attribute Editing
SaGAN, by Key Lab of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology; University of Chinese Academy of Sciences; and CAS Center for Excellence in Brain Science and Intelligence Technology
2018 ECCV, Over 150 Citations (Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)
Image Synthesis: 20142019 [SAGAN]
==== My Other Paper Readings Are Also Over Here ====

  • The spatial attention mechanism into GAN framework (referred to as SaGAN), is proposed to only alter the attributes-specific region and keep the rest unchanged. SaGAN consists of a generator and a discriminator:
  • The generator contains an attribute manipulation network (AMN) to edit the face image, and a spatial attention network (SAN) to localize the attribute-specific region which restricts the alternation of AMN within this region.
  • The discriminator endeavors to distinguish the generated images from the real ones, and classify the face attribute.


  1. SaGAN Discriminator
  2. SaGAN Generator
  3. Results

1. SaGAN Discriminator

Generative Adversarial Network with Spatial Attention (SaGAN)
  • The discriminator D, as adversary of generator, has two objectives, one to distinguish the generated images from the real ones, and another to classify the attributes of the generated and real images.
  • Both uses Softmax at the end.
  • The loss for optimizing the real/fake classifier is formulated as a standard cross-entropy loss as below:
  • where I is the real image and ˆI is the generated image.
  • Similarly, the loss for optimizing the attribute classifier is also formulated as a standard cross-entropy loss as below:
  • where cg is the ground truth attribute label of the real image I.
  • Finally, the overall loss function for discriminator D is:

2. SaGAN Generator

  • The generator G endeavors to translate an input face image I into an edited face image ˆI conditioned on an attribute value c:
  • G contains two modules, an attribute manipulation network (AMN) denoted as Fm and a spatial attention network (SAN) denoted as Fa. AMN focuses on how to manipulate and SAN focuses on where to manipulate.
  • AMN takes a face image I and an attribute value c as input, and outputs an edited face image Ia:
  • SAN takes the face image I as input, and predict a spatial attention mask b, which is used to restrict the alternation of AMN within this region:
  • where in b, 1 is attribute-specific region, 0 is the rest region. For predicted region, pixel larger than 0 is already attribute-specific region.
  • Guided by the attention mask, in the final edited face image ˆI, the attribute-specific regions are manipulated towards the target attribute while the rest regions remain the same, formulated as below:
  • Firstly, an adversarial loss is designed to confuse the real/fake classiier following most GAN-based methods:
  • Secondly, to make ˆI be correctly with target attribute c, an attribute classiication loss is designed to enforce the attribute prediction of ˆI from the attribute classiier approximates the target value c as below:
  • Last but not least, to keep the attribute-irrelevant region unchanged, a reconstruction loss is employed similar as CycleGAN and StarGAN:
  • where cg is the original attribute of input image I.
  • The first term is dual reconstruction loss using back translation. It is expected to be the same as the original image I.
  • The second term is identity reconstruction loss, which guarantees that an input image I is not modified when edited by its own attribute label cg.
  • The overall objective function to optimize G is:
  • For training stability, WGAN-GP technique is used in Eq. (1) and Eq. (8):
SaGAN Generator & Discriminator Model Architecture
  • The model architecture details of SaGAN Generator and Discriminator are shown above.

3. Results

3.1. Visual Quality

The spatial attention masks mainly concentrate on the attribute-specific regions, and those attribute-irrelevant regions are successfully suppressed.

3.2. Visual Quality Comparison

Compared with CycleGAN and StarGAN, ResGAN and SaGAN preserves most attribute-irrelevant regions unchanged which is preferable.

  • There are some artifacts on the attribute-specific regions from ResGAN especially on the eyeglass attribute.

By contrast, SaGAN achieves favorable manipulation on the attribute-specific region and preserve the rest irrelevant regions unchanged as well.

The reason lies in that the generator of SaGAN contains SAN for explicitly attribute-specific region detection.

  • All methods inevitably change the gender of the input face as no_beard is correlated with gender (e.g. no woman has beards).

Even so, SaGAN modifies the images modestly.

  • As can be seen, all methods of CycleGAN, StarGAN and ResGAN degenerate on this dataset with those distorted results.

SaGAN performs almost as good as on the CelebA, illustrating the robustness of the proposed method.

3.3. Face Recognition

  • For each attribute, a single training sample is augmented into three samples, e.g. the original face image and the two face images with adding and removing eyeglasses respectively.
  • ResNet-18 is used as the face recognition model. On both datasets, the performance is reported in terms of ROC curves.

CelebA (Left): As can be observed, for each attribute, the model with data augmentation performs better than the baseline model without data augmentation, as the augmentation with accurately attribute editing images from SaGAN enriches the variations of the training data leading to more robust model.

LFW (Right): The model with data augmentation with all face attributes expect smile are much better than the baseline model without data augmentation.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.