Review — CycleGAN: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (GAN)

Using Cycle Consistency Loss for Unpaired Image-to-Image Translation, Outperforms CoGAN, BiGAN, ALI & SimGAN

6 min readMay 15, 2021

--

**CycleGAN learns to automatically “translate” an image from one into the other**

In this story, Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, (CycleGAN), by Berkeley AI Research (BAIR) laboratory, UC Berkeley, is reviewed.

For many tasks, paired training data will not be available.

In this paper:

CycleGAN is designed to translate an image from a source domain X to a target domain Y in the absence of paired examples, i.e. G: X→Y.
This mapping is highly under-constrained, an inverse mapping F: Y→X is coupled and a cycle consistency loss is introduced to enforce F(G(X))=X (and vice versa).

This is a paper in 2017 ICCV with over 8300 citations. (Sik-Ho Tsang @ Medium)

Outline

Paired & Unpaired Training Data
Cycle Consistency
CycleGAN
Ablation Study
Quantitative Evaluation
Qualitative Results

1. Paired & Unpaired Training Data

**Paired training data (left) and Unpaired training data (right)**

Paired training data consists of training examples {xi, yi}, where the correspondence between xi and yi exists.
Most of the supervised learning is applied onto the paired training data. However, obtaining paired training data can be difficult and expensive.
Unpaired training data consists of a source set {xi} (xi∈X) and a target set {yi} (yi∈Y), with no information provided as to which xi matches which yj.

CycleGAN seeks to learn to translate between domains without paired input-output examples.

2. Cycle Consistency

A mapping G: X→Y should be learnt such that the output ^y = G(x), x∈X, is indistinguishable from images y∈Y by an adversary trained to classify ^y apart from y.
The optimal G thereby translates the domain X to a domain ^Y distributed identically to Y.
Yet, there can be infinitely many mappings G. It is difficult to optimize. Standard procedures often lead to the well-known problem of mode collapse.
A property should be exploited, i.e. translation should be “cycle consistent”.

Mathematically, if we have a translator G: X→Y and another translator F: Y→X, then G and F should be inverses of each other.
A cycle consistency loss is added that encourages F(G(x))≈x and G(F(y))≈y.
Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation.

3. CycleGAN

**(a) CycleGAN containing two mapping functions G and F, (b) forward cycle-consistency loss, (c) backward cycle-consistency loss**

3.1. Adversarial Loss

For the mapping function G: X→Y and its discriminator DY:

where G tries to generate images G(x) that look similar to images from domain Y, while DY aims to distinguish between translated samples G(x) and real samples y.
A similar adversarial loss for the mapping function F: Y→X and its discriminator DX are introduced.

3.2. Cycle Consistency Loss

Adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi.
It is argued that the learned mapping functions should be cycle-consistent.
Forward cycle consistency: For each image x from domain X, the image translation cycle should be able to bring x back to the original image, i.e., x→G(x)→F(G(x))≈x.
Similarly for backward cycle consistency: y→F(y)→G(F(y))→y.

**The input images x, output images G(x) and the reconstructed images F(G(x))**

The reconstructed images F(G(x)) end up matching closely to the input images x.

3.3. Full Objective

The total loss is:

where λ controls the relative importance of the two objectives.
The loss to be solved:

3.4. Viewed As Autoencoder

CycleGAN can be viewed as training two “autoencoders”: learning one autoencoder F○G: X→X jointly with another G○F: Y→Y.
However, each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain.
Such a setup can also be seen as a special case of “adversarial autoencoders”, which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution.
In CycleGAN, the target distribution for the X→X autoencoder is that of the domain Y.

3.5. Architecture

The architecture for generative networks from Johnson [23] is used, which has shown impressive results for neural style transfer and super resolution.
This network contains two stride-2 convolutions, several residual blocks, and two fractionally-strided convolutions with stride 1/2 .
6 blocks are used for 128×128 images and 9 blocks are used for 256×256 and higher resolution training images.
Similar to Johnson’s, instance normalization is used.
For the discriminator networks, 70×70 PatchGANs [22, 30, 29] are used, which aim to classify whether 70×70 overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarily-sized images in a fully convolutional fashion.

3.6 Implementation

The negative log likelihood objective is replaced by a least-squares loss.
To train G, the below loss is minimized:

To train D, the below loss is minimized:

Discriminators are updated using a history of generated images rather than the ones produced by the latest generators. An image buffer is kept that stores the 50 previously created images.

4. Ablation Study

**FCN-scores for different CycleGAN variants on Cityscapes labels→photo**

**Classification performance of photo→labels for different CycleGAN variants**

Removing the GAN loss substantially degrades results, as does removing the cycle-consistency loss.
CycleGAN with the cycle loss in only one direction is also tried. It is found that it often incurs training instability and causes mode collapse.

**Different variants of our method for mapping labels↔photos trained on cityscapes**

Both Cycle alone and GAN+backward fail to produce images similar to the target domain.
GAN alone and GAN+forward suffer from mode collapse, producing identical label maps regardless of the input photo.

5. Quantitative Evaluation

**AMT “real vs fake” test on maps→aerial photos**

“real vs fake” perceptual studies are run on Amazon Mechanical Turk (AMT), with 25 participants.
All the baselines almost never fooled participants.

CycleGAN can fool participants on around a quarter of trials.

**FCN-scores for different methods, evaluated on Cityscapes labels→photo.**

**Classification performance of photo→labels for different methods on cityscapes**

The FCN predicts a label map for a generated photo. Semantic segmentation metrics is also used.
It is noted that Pix2Pix works on paired data, which can be treated as upper-bound performance.