Brief Review — Generative Semantic Manipulation with Mask-Contrasting GAN

Mask Contrast-GAN, Outperforms CoGAN, BiGAN, CycleGAN

Sik-Ho Tsang
4 min readSep 11, 2023
Visualizations: Source and Target Have Different Shapes
More Visualizations

Generative Semantic Manipulation with Mask-Contrasting GAN
Mask Contrast-GAN, by Carnegie Mellon University, Sun Yat-sen University
2018 ECCV, Over 50 Citations (Sik-Ho Tsang @ Medium)

Generative Adversarial Network (GAN)
Image-to-image Translation: 2017 [Pix2Pix] [UNIT] [CycleGAN] 2018 [MUNIT] [StarGAN] [pix2pixHD] [SaGAN]
==== My Other Paper Readings Are Also Over Here ====

  • A contrasting GAN (contrast-GAN) with a novel adversarial contrasting objective which is able to perform all types of semantic translations with one category-conditional generator.
  • Distance comparisons between samples are used for the training objective, enforcing the manipulated data be semantically closer to the real data with target category than the input data.


  1. Contrast-GAN
  2. Mask Contrast-GAN
  3. Results

1. Contrast-GAN

Contrast-GAN Overview
  • The feature representation of manipulated result y should be closer to those of real data {y} in target domain Y than that of x in input domain X under the background of object semantic cy.
  • The generator aims to minimize the contrasting distance Q(·):
  • where fx, fy and fy’ are the feature embeddings for different images x, y and y’ respectively.
  • The discriminator aims to maximize the contrasting distance:
  • To further reduce the space of possible mapping functions, the cycle-consistency loss in CycleGAN is also used, which constrains the mappings (induced by the generator G) between two object semantics should be inverses of each other:
  • Therefore, the full objective is computed:
  • so that G tries to minimize this objective against a set of adversarial discriminators {Dcy} that try to maximize them:

2. Mask Contrast-GAN

Mask Contrast-GAN
  • An image has objects and background. Mask is needed in order to crop out the object we want to manipulate.
  • Object mask is obtained from the dataset, such as MS COCO segmentation mask.
  • With the object mask obtained, a masking operation and subsequent spatial cropping operation are performed. The background image is calculated by functioning the inverse mask map on an input image.
  • Then, an encoder-decoder architecture is used with input of target category cy as well.
  • These target category cy using a one-hot vector which is then passed into a linear layer to get a feature embedding with 64 dimension. This feature is then replicated spatially.
  • The manipulated region is wrapped back into the original image resolution, which is then combined with the background image via an additive operation to get the final manipulated image.
  • Both local discriminators {Dcy} defined in the proposed contrast-GAN and a global image discriminator DI, are used.

3. Results

3.1. FCN Scores

Labels to Photos
Photos to Labels

In both cases, the proposed contrast-GAN with a new adversarial contrasting objective outperforms the state-of-the-arts on unpaired image-to-image translation.

3.2. Human Perception Test

Human Perception Test

The method substantially outperforms the baseline on all tasks.

3.3. Qualitative Results

  • Mask is also applied onto CycleGAN, which is treated as a baseline.
  • The baseline method often tries to translate very low-level information (e.g. color changes) and fails to edit the shapes and key characteristic (e.g. structure) that truly convey a specific high-level object semantic.

However, the proposed contrast-GAN tends to perform trivial yet critical changes on object shapes and textures to satisfy the target semantic while preserving the original object characteristics.

  • Fig. 7: The original GAN networks often renders the whole image with the target texture and ignores the particular image content at different locations/regions.

Fig. 6 & Fig. 7: Mask Contrastive GAN has the promising capability of manipulating object semantics while retaining original shapes, viewpoints, and interactions with the background.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.