Review — SoildNet: Soiling Degradation Detection in Autonomous Driving

SoildNet for Soiling Detection, Formed by Dynamic Group Convolution From ResNeXt, and Channel Reordering From ShuffleNet V1

7 min readAug 1, 2021

In this story, SoildNet: Soiling Degradation Detection in Autonomous Driving, (SoildNet), by Valeo India, is reviewed.

Camera sensors are extremely prone to soiling such as rain drops, snow, dust, sand, mud and so on.

In this paper:

SoildNet (Sand, snOw, raIn/dIrt, oiL, Dust/muD) is proposed with the use of dynamic group convolution and channel reordering, make it suitable for low power embedded systems.
Soiling is detected at tile level of size 64×64 on 1280×768 input image.
Clean, opaque soiling and transparent soiling are classified.

This is a paper in 2019 NeurIPS Workshop. (Sik-Ho Tsang @ Medium)

Outline

Soiling Types, Classes and Dataset
SoildNet: Network Architecture
Embedded Platform Constraints
Experimental Results

1. Soiling Types, Classes and Dataset

1.1. Types of Soiling

**Different types of soiling: (a) grass, (b) fog, (c) rain drops, (d) dirt, (e) splashes of mud, (f) splashes of mud in night**

The decline in vision can be either due to adverse weather conditions and this covers soiling types for example snow, rain drops, fog, etc.
Or the other types that emerge regardless bad weather such as mud, grass, oil, dust, sand, etc.

1.2. Classes of Soiling

It can be divided into three categories: clean, opaque and transparent.
Clean: When a tile has completely free-view
Opaque: A tile is marked as opaque when the vision is totally blocked.
Transparent: Due to the uneven distribution of the soiling objects on the camera lens, the tile does not loose complete visibility.

In this study, an input image of resolution 1280×768 is annotated per tile of size 64×64 and each tile represents a soiling class, hence it is possible to see an input image that contains all soiling categories across tiles.

1.3. Dataset

it is more critical to predict a clean tile correctly. This is because, a high number of false positives will lead to cleaning a camera that is already clean more frequently.
In this experiment, total 144,053 sample images are used out of which 70,000 samples are pure clean images, which means that all tiles are soiling free. Higher number of clean samples will help to learn better discriminative features of clean class, hence the model tends to be biased towards clean.
4 cameras are used in the setup: Front View (FV), Rear View (RV), (Mirror View Left (MVR), Mirror View Right (MVL).
The distribution of sample across cameras is as follows, FV: 36 259; RV: 36,160, MVR: 35,435; MVL: 36,199.
A tile of size 64×64 on input resolution 1280×768 makes 20 tiles along width and 12 tiles along height, thus a single sample contains total 20×12 tiles.
The tile based class distributions are — Clean: 25,459,238; Opaque: 6,341,435; Transparent: 2,772,047.

**Camera view-wise presence of all three classes**

The above table shows the classes are adequately distributed in all four camera views.
The dataset is subdivided into training, validation and test sets with partition ratios of 60%, 20% and 20% respectively.
All the samples are fisheye and in YUV420.

2. SoildNet: Network Architecture

**Proposed networks, Net-1 (Leftmost), Net-2, Net-3, Net-4, and SoildNet (rightmost). Conv: convolution, G: group size, S: stride size**

2.1. Net-1 (Baseline)

**(a) Common BGR input data layer, (b) Optimized YUV4:2:0 data layer. (Figure from** **YUVMultiNet**)

First a base network (Net-1) is designed to take YUV420 input image.
As there is a mismatch of dimension between Y and UV. Through convolution operation, both set of feature maps (Y and UV) are brought down to similar dimension and concatenated to make the network single stream, as shown in (b) above.
(This idea is from YUVMultiNet: Real-time YUV multi-task CNN for autonomous driving.)
Net-1 is further refactored to 4 other networks (Net-2, Net-3, Net-4 and SoildNet).

2.2. Group Convolution (ResNeXt)

**Left:** **ResNet** **Block, Right:** **ResNeXt** **with 32 Group Convolutions (Figure from** **ResNeXt**)

The idea of the group convolution is to perform convolution operation group wise, which was first introduced in AlexNet. However the main intention was to distribute the number of operations in two GPUs.
Later in ResNeXt, this proposal was used to boost the accuracy.
The current work extends this concept by adding group convolution in all convolution layers (Net-2, Net-3, Net-4, SoildNet).
Static number of groups is used in Net-2 and Net-3. But group convolution was found to be not very effective for the networks with low depth.
Also, dynamic group size (Net-4, SoildNet) is used to reduce the network complexity by more than two times in trainable parameters (Net-3 vs. Net-4).
While adding group convolution at all layers of the network brings another challenge of insufficient feature blending. This is overcome through channel reordering.

2.3. Channel Reordering (ShuffleNet V1)

**Channel Reordering/Shuffle, Performed for 3 Group Convolutions (Figure from** **ShuffleNet V1**)

The concept of channel reordering is highly inspired from ShuffleNet V1.
While performing group convolution, the feature information are limited within the group.
To make the features blend across groups, the feature maps are shuffled in an ordered way (Not random) that makes sure in the next layer each group contains at least a candidate feature map from each group of the previous layer.
Generally a convolution layer with kernel size 1×1 can be added to blend the features across channel, but this will increase the trainable parameters.
Net-4 contains different number of groups at each layer.
SoildNet contains same number of groups as Net-4 but it includes channel reordering method.

2.4. Analysis of SoildNet

**Analysis of computation complexity of all network proposals including** **ResNet-10**

Total number of operations involved in a network is represented by GMACS (Giga Multiply Accumulate Operations per Second) unit.
More than 90% reduction of network parameters is achieved by group convolution from baseline network in two variants of SoildNet (with and without channel reordering).
The model size is reduced by more than 7 times.
The GMACS of SoildNet (and Net-4) is quite less than the baseline or other network schemes.
It is noted that channel reordering technique does not have any influence on the size of network. There is no impact on model size as well.
Batch normalization layer is added between each convolution layer and ReLU as activation.
Categorical cross entropy and categorical accuracy were used as loss and metrics respectively respectively

3. Embedded Platform Constraints

Most of the embedded SoCs follow 16-bit fixed point operations, hence the data are quantized when the model trained on a GPU is deployed on a target device.
In this study, the throughput of the SoC is 1 TOPS and capable to support 400 GMACS.

3.1. ResNet-10

The reported GMACS of ResNet-10 is way too much high to be considered for an embedded platform.
A model of size more than 5MB is questionable, especially when we target higher FPS (Frames Per Second).
The memory budget heavily increases with the networks containing residual connections.
Thus, ResNet-10 is not considered further.

3.1. Group Convolution (ResNeXt)

As per GPU implementation, performing group convolution involves first slicing input feature maps into a number of groups, then execute convolution operation on each group and finally concatenate the output feature maps of all groups.
This extra overhead does not exist in the embedded environment.
To follow group wise convolution, simply memory address of the feature maps need to be sent in group wise fashion.

3.2. Channel Reordering (ShuffleNet V1)

Channel reordering needs extra effort on GPU that includes again feature map slicing in a way so that resultant feature maps are in desired order.
However, this effort is completely neutralized on the hardware since the feature maps are handled only through memory locations.

4. Experimental Results

4.1. Quantitative Results

**Comparison of class-wise accuracy for tile level soiling degradation detection**

Few standard metrics are considered such as TPR (True Positive Rate), TNR (True Negative Rate), FPR (False Positive Rate), FNR (False Negative Rate) and FDR (False Discovery Rate) respectively.
A good network should aim for higher values of TPR, TNR and lower values of FPR, FNR, FDR respectively.
The recipe of dynamic group convolution with channel reordering makes the network robust to learn better discriminative features for all classes equally.
About 10% gain in TPR for class transparent from the base network (Net-1) without degrading the performance of other classes.
Even though Net-2 shows promising performance for class clean but it fails to provide a reasonable accuracy for class transparent.
Transparent class has comparatively low TPR across all networks since it is often confused with the clean class.

The above table further summarizes the results by computing the average of class wise accuracy for each metric.
SoildNet outperforms other networks on 4 metrics.

4.2. Qualitative Results

**Examples of 64×64 tile based soiling degradation detection,** Color codes: **Green** — Clean, **Cyan** — Opaque, **Blue** — Transparent.

Reference

[2019 NeurIPS Workshop] [SoildNet]
SoildNet: Soiling Degradation Detection in Autonomous Driving

Soiling Detection / Desoiling

[Desoiling Dataset] [SoildNet]