Review — Let’s Get Dirty: GAN Based Data Augmentation for Camera Lens Soiling Detection in Autonomous Driving

DirtyGAN is Proposed to Generate Soiled Images Using VAE and CycleGAN

6 min readAug 11, 2021

In this story, Let’s Get Dirty: GAN Based Data Augmentation for Camera Lens Soiling Detection in Autonomous Driving, (DirtyGAN), by Valeo, is reviewed. In this story:

A novel GAN based algorithm, DirtyGAN, for generating unseen patterns of soiled images, to increase the number of images for training.
The soiling masks are automatically generated, which eliminates the manual annotation cost.

This is a paper in 2021 WACV. (Sik-Ho Tsang @ Medium)

Outline

Motivations
Baseline Approach Using CycleGAN
Proposed DirtyGAN & Dirty Dataset
Experimental Results

1. Motivationss

1.1. The Need of GAN

**Several examples from the WoodScape Dataset (Background is marked by black color)**

There is only limited work on general soiling.
Due to the difficulty in collecting a diverse dataset as it is a relatively rare event.
Soiling annotation is tedious, difficult to annotate the soiling boundary.
Inconvenient to record data.
A clean version of the same images is needed for fair comparison.

1.2. Some Limitations of CycleGAN

The main problem of CycleGAN is that it modifies the whole image lead to undesired artifacts.
Due to GPU memory requirements and time constraints, the CycleGAN training uses rescaled images (1/4 of both width and height).

Thus, a better GAN-based approach is needed to generate soiled images.

2. Baseline Approach Using CycleGAN

2.1. Segmentation Network M

**Clean: Dark Gray, Opaque: White, Semi-Transparent: Light gray**

First, a Soiling Semantic Segmentation network, M, is trained using the weak polygonal annotation of soiling, as shown in the image above.
Then, we can obtain the above outputs based on the segmentation values.

2.2. Super Resolution Network U

After that, a super-resolution network U is trained, super-resolve (up-sample) the GAN generated image to the original image resolution (i.e., up-scaling of 4 factor).

2.3. GAN Network CycleGAN

First, with the clean image I, CycleGAN is used to generate the soiled image Is.
Then, the trained segmentation network M is used to segment the soiled image that just generated by CycleGAN, to obtain the M(Is).
Gaussian smoothing γ is applied onto the segmentation output M(Is), i.e. m=γ(M(Is)).
For the soiling mask m, 0 means background, and 1 means soiling. The intermediate values are semi-transparent soiling.
The trained super resolution network U is used to upsample the generated image and the mask as they are downsized.
(U is used for both generated images and the smoothed segmentation masks. So, what is the training set to train this U?)
Finally, the resulting artificially “soiled” image obtained by convex combination of the original image and the generated soiling via the segmented mask as below:

2.4. The Limitations of Baseline Approach

This simple pipeline has certain limitations.
The biggest one is that it cannot be expected to work smoothly for soiling types caused by water (e.g., raindrops, snow).
CycleGAN regenerates the whole image, affecting all pixels.
There is also no control over the soiling pattern production process.
There is no visual clue for where the soiling should be produced.

One solution is to formulate a CycleGAN-like approach, which can cope with changing only those parts of the image that correspond to the soiling pattern and keep the rest unchanged.

3. Proposed DirtyGAN & Dirty Dataset

Thus, DirtyGAN is proposed to guide the pattern generation process via a Variational AutoEncoder (VAE) with also CycleGAN modified so that it applies only on the masked regions of the source and target domain images.

3.1. VAE

**Variational AutoEncoder and the walk on the soiling manifold**

By using the encoder of the trained VAE, we can obtain the projection of an actual sample from the dataset to a lower-dimensional representation.
If we select two samples z1 and z2 that are close on the soiling pattern manifold, we can obtain a novel sample z by taking their convex combination:

where α ∈ [0, 1]. This intermediate representation z can be used to reconstruct the corresponding soiling pattern.
The benefit of using sampling from the learned VAE is that we could even use it to create animated masks, e.g., to mimic dynamic soiling effects, such as water drops in heavy rain.

3.2. DirtyGAN

After training the VAE, CycleGAN is limited to be applied only on the masked regions corresponding to the generated mask for the “clean” → “soiled” translation or the mask obtained by the soiling semantic segmentation mask M.

3.3. Dirty Dataset

**Examples of the generated images from the Dirty WoodScape Dataset**

By using DirtyGAN, the generated data with updated annotation will be released as a WoodScape Dataset companion under the name of Dirty WoodScape Dataset.
The above figures show the generated images.
White pixels represent the soiling, while black pixels represent the background.

**Examples of the generated images from the Dirty Cityscapes Dataset**

The above images are the generated Cityscapes dirty images.
But when I click into the GoogleDrive link provided by the dataset website:

It shows empty directory only:

(Hope authors will release their dataset in the future. And it is unknown that if Dirty Cityscapes Dataset is also shared to public or not.)

4. Experimental Results

4.1. Data Augmentation for Soiling Detection Task

**Comparison of Soiling Segmentation model trained on generated and real soiled images**

The classifier used in the experiment uses ResNet50 for an encoder and FCN8 for a decoder.
The binary cross-entropy was used as a loss function.
Accuracy is computed on a real test dataset with 2,000 images.
When a soiling segmentation network is trained on real soiling data only (8k images), 73.95% accuracy is obtained.
When trained using standard data augmentation techniques (flipping, contrast changes), 78.20% is achieved.
When trained with the generated images by DirtyGAN, the network’s performance increased to 91.71% on real soiling test data, forming a 17.76% increase in accuracy without the need for costly annotations and real-time soiling scene captures.

4.2. Artificial Soiling Quality

Human participants ranging from absolutely no knowledge, to people working with the soiling data daily, are asked to classify the presented images either as real ones or fakes.
The non-expert participants were not able to recognize real images from fakes.
The expert participants were sometimes able to spot a difference in the soiling pattern. They sometimes spotted small artifacts, e.g., blurry texture.
In general, It can be said that the image quality of the generated artificial soiling is satisfactory when judged by human inspection for 95% of scenarios.

4.3. Degradation Effect on Semantic Segmentation

The dataset consists of 10k pairs of images (clean and synthetically soiled using the proposed baseline framework).
Train/Test: 80:20.
Two DeepLabv3+ are trained on the clean and soiled images, respectively.

**mIoU (%) on our WoodScape dataset using** **DeepLabv3+**

A segmentation model trained on clean images records 56.6% mIoU on clean test data and 34.8% on soiled data, a performance drop of 21.8% compared to clean images.
This significant drop shows that soiling can cause severe degradation.
However, training on the soiled images shows a only 4% accuracy degradation on clean test data compared to the baseline when evaluated on clean images.