Review — pix2pixHD: High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

pix2pixHD, Higher Resolution Than pix2pix

Sik-Ho Tsang
4 min readAug 18, 2023
Example Results

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
pix2pixHD, by NVIDIA Corporation, and UC Berkeley
2018 CVPR, Over 3800 Citations (Sik-Ho Tsang @ Medium)

Image-to-image Translation: 2017 [Pix2Pix] [UNIT] [CycleGAN] 2018 [MUNIT]
==== My Other Paper Readings Are Also Over Here ====

  • pix2pixHD is proposed, which has a novel adversarial loss, as well as new multi-scale generator and discriminator architectures.
  • An object instance segmentation information is incorporated, which enables object manipulations such as removing/adding objects and changing the object category.


  1. pix2pixHD
  2. Results

1. pix2pixHD

pix2pixHD Generator

Coarse-to-fine generator: The generator is decomposed into two sub-networks: G1 and G2. G1 is used as the global generator network and G2 is used as the local enhancer network.

  • G1 operates at a resolution of 1024×512, and G2 outputs an image with a resolution that is 4× the output size of the previous one.

1.1. Global Generator Network

G1 is built on the architecture proposed by Perceptual Loss, which has been proven successful for neural style transfer on images up to 512×512.

  • It consists of 3 components: a convolutional front-end G(F)1 , a set of residual blocks G(R)1, and a transposed convolutional back-end G(B)1.
  • A semantic label map of resolution 1024×512 is passed through the 3 components sequentially to output an image of resolution 1024×512.

1.2. Local Generator Network

  • G2 also consists of 3 components: a convolutional front-end G(F)2, a set of residual blocks G(R)2 , and a transposed convolutional back-end G(B)2. The resolution of the input label map to G2 is 2048×1024.

Different from the global generator network, the input to the residual block G(R)2 is the element-wise sum of two feature maps: the output feature map of G(F)2, and the last feature map of the back-end of the global generator network G(B)1. This helps integrating the global information from G1 to G2.

1.3. Improved Adversarial Loss

A feature matching loss LFM based on the discriminator is introduced.

  • This loss stabilizes the training as the generator has to produce natural statistics at multiple scales.
  • where T is the total number of layers and Ni denotes the number of elements in each layer.

The full objective combines both GAN loss and feature matching loss:

1.4. Using Instance Maps

  • Semantic label map is an image where each pixel value represents the object class of the pixel. This map does not differentiate objects of the same category.

The most critical information the instance map provides, which is not available in the semantic label map, is the object boundary. This is especially useful for the street scene since many parked cars or walking pedestrians are often next to one another.

Instance-Wise Features
  • Low-dimensional feature channels are used as the input to the generator network.

To generate the low-dimensional features, an encoder network E is used to find a low-dimensional feature vector that corresponds to the ground truth target for each instance.

  • An instance-wise average pooling layer is added to the output of the encoder to compute the average feature for the object instance. The average feature is then broadcast to all the pixel locations of the instance.

Model trained with instance boundary maps renders more photo-realistic object boundaries.

2. Results

2.1. Quantitative Results

  • Semantic segmentation is performed on the synthesized images and it is used to compare how well the predicted segments match the input. The intuition is that if we can produce realistic images that correspond to the input label map, an off-the-shelf semantic segmentation model (e.g., PSPNet that this paper uses) should be able to predict the ground truth label.

pix2pixHD obtains better performance.

Single vs Multi-Scale Discriminators

Using multi-scale discriminators helps produce higher quality results as well as stabilize the adversarial training.

2.2. Qualitative Results

Cityscapes, NYU, ADE20K, Helen Face Datasets

pix2pixHD obtains more realistic results than such as pix2pix.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.