Review — SoilingNet: Soiling Detection on Automotive Surround-View Cameras

Tile-Level Classification Soiling Detection Using SoilingNet; CycleGAN for Data Augmentation

Sik-Ho Tsang
7 min readAug 4, 2021
Illustration of Soiled Camera
Soil Detected, Triggered Camera Cleaning System

In this story, SoilingNet: Soiling Detection on Automotive Surround-View Cameras, (SoilingNet), by Valeo, is reviewed.

  • Camera is easily soiled. With soiling detection, the autonomous driving system can alert the driver. Or a camera cleaning system can be triggered to clean the camera. With higher accuracy, less water is used for false alarm, filling water to the tank can be less frequent.

In this paper:

  • A new dataset having multiple types of soiling namely opaque and transparent is created, which will be released publicly as part of the WoodScape dataset.
  • CycleGAN is used to generate more images for data augmentation.
  • Multi-Task CNN is used for soiling detection.

This is a paper in 2019 ITSC. (

@ Medium)


  1. Formal Definition
  2. Dataset Creation
  3. GAN Based Data Augmentation (CycleGAN)
  4. CNN Based Soiling Detection (Multi-Task CNN)
  5. Experimental Results

1. Formal Definition

1.1. Class Definition

  • The camera soiling detection task is defined as a multilabel classification problem.
  • Each image can be described by a binary indicator array, where zeros, and ones correspond to the absence or presence of a specific soiling class, respectively.
  • The soiling classes we define are C = {opaque, transparent}.
  • Opaque: Regions which are coloured by nontransparent colors preventing to see what is behind.
  • Transparent: Regions which are apparently blurred or deformed from the expected appearance, however it is possible to distinguish colors from the original scenery, i.e. it is possible to see “behind”.
  • It seems to be unnatural but there is also an “opaque” water soiling.
  • Given a single image I, we are interested in a classifier g : IC², where C denotes the set of class labels: C = {opaque, transparent}, where the labels are supposed to be binary, specifying if the given type of soiling (i.e. opaque or transparent) is present (in such case the value is 1) or not (the value is 0).
  • Then clearly a vector c1 = [0, 0] denotes a clean image, while c2 = [1, 1] denotes an image with both type of soiling categories present.

1.2. Hamming Distance as Evaluation Metric

  • The misclassifications by Hamming distance of the binary indicator arrays of the ground truth manually annotated labels and the predictions returned by the classifier:
  • where c and cGT denotes the predicted and the ground-truth binary encoded category, respectively. ham(.,.) denotes the Hamming distance.
  • The overall error is measured as the average value of εcat(cGT, c) over the whole testing set.

(Although hamming distance is said to be used, it seems that authors just simply treat clean, opaque soil only, transparent soil only, and opaque+transparent soil, as 4 classes in the experiments. Please correct me if I am wrong.)

1.3. Tile-Level Classification

Tile level ground truth generation derived from polygons
  • The above definition of image level classification can be generalized to tile-level.
  • When the tile size is 1×1, it specializes to pixel-level labelling semantic segmentation task.
  • When the tile size is equal to image height by image width, it becomes image classification task.
  • In a classical feature extraction plus classifier setting, each tile can be independently processed but with deep learning models global context is leveraged for output of each tile.

2. Dataset Creation

Soiling annotation using polygons
  • 76,448 images are collected, by extracting each 10-th frame out of short video recordings.
  • The annotations were created manually, by clicking points of a coarse polygonal segmentation of the soiling contained in the image with an additional information about the qualities of the soiling (if it is transparent or opaque).
  • Non-overlapping training, validation, and test sets are split in a ratio of 60/20/20.
  • Stratified sampling approach, to retain the underlying distributions of the classes among the splits. This process was done independently for the whole-frame based approach and for the tile-based approach, respectively.
  • 45,868 and 15,291 number of images are used for training and evaluation purpose, respectively.
  • Training set: 22,015 clean images ([0,0]). 10,704 opaque soiling images ([1,0]), 4,623 transparent soiling images ([0, 1]), 7,526 opaque+transparent soiling images ([1,1]).
  • Testing set: 7,339 clean images ([0,0]). 3,902 opaque soiling images ([1,0]), 1,541 transparent soiling images ([0, 1]), 2,509 opaque+transparent soiling images ([1,1]).

A subset of 5,000 images is provided to the community, which can be downloaded in the WoodScape dataset:

3. GAN Based Data Augmentation (CycleGAN)

Left: original clean image; Right: generated ”soiled“ image
Left: original “soiled” image; Right: generated clean image

3.1. Motivation

  • Getting relevant data for the soiling classification task is a very tedious task.
  • First problem are the suitable conditions for increasing the probability that the soiling event might even occur. This problem can be solved by manually “soiling” the camera lens.
  • Another problem is the annotation of such imagery, which is extremely expensive and time consuming.

3.2. CycleGAN

  • GAN, particularly CycleGAN, is used to alleviate the lack of relevant data.
  • The main problem of CycleGAN experiment is the inability to produce variable output. There is no control over the generated image.
  • MUNIT is also tried, which separates the content in the image from its style and via this we have the possibility to control the image appearance by changing the style provided at the image generation step.
  • Unfortunately, MUNIT is not able to be converged.

Finally CycleGAN is used. The results are shown as above.

  • (But there is no details about the experimental setup, e.g.: how many images are used to train the CycleGAN, how many generated images are used for training the CNN for soiling detection.)

4. CNN Based Soiling Detection (Multi-Task CNN)

Illustration of soiling integrated into multi-task object detection and segmentation network

4.1. Baseline Network

  • A low power embedded platform having computational processing power of 1 Tera OPS, is used.
  • Soiling detection is an additional module on top of object detection and segmentation.
  • Existing CNN features in the system are leveraged by sharing the encoder for a soiling decoder in a multi-task network.
  • The details of the baseline multi-task network are in [13]. It comprises of a simplified ResNet-10 like encoder and two decoders namely simplified YOLOv2 for object detection and simplified FCN8 for segmentation.
  • The new soiling decoder is added as a third task.
  • The output of the soiling decoder is tiled soiling class output.
  • The number of tiles is set at training time.
  • (For more details about the network architecture, need to read [13].)

4.2. Two Configuration: Tile-Level and Image-Level

  • There are wo configurations, one in which the tile size was 64×64 and in the other case the output was at image resolution.
  • The three losses were scalarized into one loss using a weighted average and the weights were optimized by hyper-parameter tuning using grid search.
  • As an ablation study to evaluate soiling detector on its own without multi-task network, the other decoders are removed.
  • Soiling decoder has two convolution layers and a final grid level softsign layer which produces prediction of soiling type for each grid.
  • In case of image level soiling experiments, decoder has the same convolution layers but with a higher stride to reduce the spatial dimensions less gradually. Binary cross entropy per class is used as the loss function.
  • The input image resolution is 1280×800.
  • The number of annotation samples is much higher at image level.

5. Experimental Results

5.1. Image Level Classification

Summary of results of Image-level soiling classification
  • Both normalized and un-normalized confusion matrices are shared.
  • There are three sets of experiments — (1) training and testing only on front and rear cameras, (2) training on front/rear cameras and testing on all cameras and (3) training on all cameras and testing on all cameras:
  1. High detection rate was obtained for each class.
  2. There is degradation in accuracy indicating that the network doesn’t generalize from front/rear to left/right network.
  3. The results show that when the same network is trained on all images, it performs well and slightly better than (1) as there is better regularization.

5.2. Tile Level Classification

Tile-level soiling classification accuracy
  • Unlike high accuracy values of image level task, precision values obtained here show that there is a lot of room for further improvement.
  • Front and rear have a higher probability of getting soiling especially with difficult scenarios and thus it has much higher samples.
Evaluation of different training strategies for image-level soiling classification
  • ImageNet pre-trained encoder is worse than soiling single-task encoder by 4% in True Positive Rate (TPR) and 2% in False Positive Rate (FPR).
  • Multi-task encoder achieves close to single-task network encoder with a 1% reduction in TPR, however this comes at a much higher system level efficiency.
  • Multi-task encoder is further improved by 3% by making using of additional GAN generated images.
  • There are practical challenges in this work.
  1. There is no spatial or geometric structure present in soiling pattern unlike other objects.
  2. In particular, transparent soiling is challenging to detect as it subtly changes the appearance of the scene.
  3. Motion blurred areas which commonly occur in high speed scenes often becomes classified as transparent soiling.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.