Review — CycleGAN: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (GAN)

Using Cycle Consistency Loss for Unpaired Image-to-Image Translation, Outperforms CoGAN, BiGAN, ALI & SimGAN

Sik-Ho Tsang
6 min readMay 15, 2021
CycleGAN learns to automatically “translate” an image from one into the other

In this story, Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, (CycleGAN), by Berkeley AI Research (BAIR) laboratory, UC Berkeley, is reviewed.

For many tasks, paired training data will not be available.

In this paper:

  • CycleGAN is designed to translate an image from a source domain X to a target domain Y in the absence of paired examples, i.e. G: XY.
  • This mapping is highly under-constrained, an inverse mapping F: YX is coupled and a cycle consistency loss is introduced to enforce F(G(X))=X (and vice versa).

This is a paper in 2017 ICCV with over 8300 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Paired & Unpaired Training Data
  2. Cycle Consistency
  3. CycleGAN
  4. Ablation Study
  5. Quantitative Evaluation
  6. Qualitative Results

1. Paired & Unpaired Training Data

Paired training data (left) and Unpaired training data (right)
  • Paired training data consists of training examples {xi, yi}, where the correspondence between xi and yi exists.
  • Most of the supervised learning is applied onto the paired training data. However, obtaining paired training data can be difficult and expensive.
  • Unpaired training data consists of a source set {xi} (xiX) and a target set {yi} (yiY), with no information provided as to which xi matches which yj.

CycleGAN seeks to learn to translate between domains without paired input-output examples.

2. Cycle Consistency

  • A mapping G: XY should be learnt such that the output ^y = G(x), xX, is indistinguishable from images yY by an adversary trained to classify ^y apart from y.
  • The optimal G thereby translates the domain X to a domain ^Y distributed identically to Y.
  • Yet, there can be infinitely many mappings G. It is difficult to optimize. Standard procedures often lead to the well-known problem of mode collapse.
  • A property should be exploited, i.e. translation should be “cycle consistent”.

Mathematically, if we have a translator G: XY and another translator F: YX, then G and F should be inverses of each other.

A cycle consistency loss is added that encourages F(G(x))≈x and G(F(y))≈y.

Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation.

3. CycleGAN

(a) CycleGAN containing two mapping functions G and F, (b) forward cycle-consistency loss, (c) backward cycle-consistency loss

3.1. Adversarial Loss

  • For the mapping function G: XY and its discriminator DY:
  • where G tries to generate images G(x) that look similar to images from domain Y, while DY aims to distinguish between translated samples G(x) and real samples y.
  • A similar adversarial loss for the mapping function F: YX and its discriminator DX are introduced.

3.2. Cycle Consistency Loss

  • Adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi.
  • It is argued that the learned mapping functions should be cycle-consistent.
  • Forward cycle consistency: For each image x from domain X, the image translation cycle should be able to bring x back to the original image, i.e., xG(x)F(G(x))≈x.
  • Similarly for backward cycle consistency: yF(y)G(F(y))y.
The input images x, output images G(x) and the reconstructed images F(G(x))
  • The reconstructed images F(G(x)) end up matching closely to the input images x.

3.3. Full Objective

  • The total loss is:
  • where λ controls the relative importance of the two objectives.
  • The loss to be solved:

3.4. Viewed As Autoencoder

  • CycleGAN can be viewed as training two “autoencoders”: learning one autoencoder FG: XX jointly with another GF: YY.
  • However, each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain.
  • Such a setup can also be seen as a special case of “adversarial autoencoders”, which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution.
  • In CycleGAN, the target distribution for the XX autoencoder is that of the domain Y.

3.5. Architecture

  • The architecture for generative networks from Johnson [23] is used, which has shown impressive results for neural style transfer and super resolution.
  • This network contains two stride-2 convolutions, several residual blocks, and two fractionally-strided convolutions with stride 1/2 .
  • 6 blocks are used for 128×128 images and 9 blocks are used for 256×256 and higher resolution training images.
  • Similar to Johnson’s, instance normalization is used.
  • For the discriminator networks, 70×70 PatchGANs [22, 30, 29] are used, which aim to classify whether 70×70 overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarily-sized images in a fully convolutional fashion.

3.6 Implementation

  • The negative log likelihood objective is replaced by a least-squares loss.
  • To train G, the below loss is minimized:
  • To train D, the below loss is minimized:
  • Discriminators are updated using a history of generated images rather than the ones produced by the latest generators. An image buffer is kept that stores the 50 previously created images.

4. Ablation Study

FCN-scores for different CycleGAN variants on Cityscapes labels→photo
Classification performance of photo→labels for different CycleGAN variants
  • Removing the GAN loss substantially degrades results, as does removing the cycle-consistency loss.
  • CycleGAN with the cycle loss in only one direction is also tried. It is found that it often incurs training instability and causes mode collapse.
Different variants of our method for mapping labels↔photos trained on cityscapes
  • Both Cycle alone and GAN+backward fail to produce images similar to the target domain.
  • GAN alone and GAN+forward suffer from mode collapse, producing identical label maps regardless of the input photo.

5. Quantitative Evaluation

AMT “real vs fake” test on maps→aerial photos
  • “real vs fake” perceptual studies are run on Amazon Mechanical Turk (AMT), with 25 participants.
  • All the baselines almost never fooled participants.

CycleGAN can fool participants on around a quarter of trials.

FCN-scores for different methods, evaluated on Cityscapes labels→photo.
Classification performance of photo→labels for different methods on cityscapes
  • The FCN predicts a label map for a generated photo. Semantic segmentation metrics is also used.
  • It is noted that Pix2Pix works on paired data, which can be treated as upper-bound performance.

In both cases, CycleGAN again outperforms the baselines: e.g. CoGAN, BiGAN, ALI and SimGAN.

6. Qualitative Results

Different methods for mapping labels→photos trained on Cityscapes images.
Different methods for mapping aerial photos↔maps on Google Maps
Collection style transfer
Other translation problems
  • There are many many impressive results in the paper. Please feel free to read the paper if interested.

Reference

[2017 ICCV] [CycleGAN]
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [CoGAN] [SimGAN] [BiGAN] [ALI]
Image-to-image Translation [Pix2Pix] [UNIT] [CycleGAN]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Camera Tampering Detection [Mantini’s VISAPP’19]
Video Coding
[VC-LAPGAN] [Zhu TMM’20] [Zhong ELECGJ’21]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.