Review — Ensemble-based Semi-supervised Learning to Improve Noisy Soiling Annotations in Autonomous Driving

Using Pseudo-Labels by Multiple Models to Refine the Manual Annotations, Generate Refined Labels with Less Noise

Left: Original Image, Middle: Manual Annotation, Right: Ensemble Refined Annotation

In this story, Ensemble-based Semi-supervised Learning to Improve Noisy Soiling Annotations in Autonomous Driving, by Independent Researcher, and Valeo Vision Systems, is reviewed. In this paper:

  • Pseudo-label driven ensemble model is introduced to handle noisy annotations, which can quickly spot problematic annotations.
  • Significant improvement is achieved using the refined annotations.

This is a paper in 2021 ITSC. (Sik-Ho Tsang @ Medium) As mentioned in the paper, the idea is inspired by “A new ensemble learning framework for 3d biomedical image segmentation,” which is published in 2019 AAAI.


  1. Woodscape Dataset & Annotations
  2. Learning an Ensemble Classifier Using Pseudo-Label
  3. Experimental Results

1. Woodscape Dataset & Annotations

  • The soiling task was first formally defined in based on WoodScape dataset.
  • WoodScape provides segmentation annotations: per-pixel annotation of a set of four classes, namely C ∈ {opaque, semi-transparent, transparent, clean}.
  • Class color coding: Green — transparent, blue — semi-transparent, red — opaque, and original image color — clean.
  • The class opaque means that in that particular region, it is impossible to see through.
  • Semi-transparent and transparent are blurry regions where background colors are visible with diminished texture. The difference between the semi-transparent and transparent classes is the ability to recognize the object through the blur.
  • The last class is clean which indicates no soiling present.

But the annotation is noisy and inaccurate, as shown in the figure at the top.

2. Learning an Ensemble Classifier Using Pseudo-Label

2.1. Proposed Architecture Overview to Refine Annotations

Proposed Architecture
  • A two stage approach is proposed for learning an ensemble classifier using the pseudo-label (PL).
  • The ensemble classifier is a neural network, consisting of two separate encoders and one decoder.
  • As shown above, the first encoder processes the input image, while the the second encoder processes all channel-wise concatenated pseudo-labels to propagate the information from them.
  • The output of both encoders is concatenated and passed to the decoder which produces refined soiling segmentation annotation.

2.2. Nine Pseudo-Label (PL) Architectures

  • 9 PLs are used, in total.
  • 1: First one is the original manual annotation, which is quite rough, and error-prone.
  • 2–4: Next, 3 PLs are formed by classification via DeepLabv3+, with ResNet-50 backbone with 3 different approaches for calculating the inference, namely: multi-scale prediction, sliding window prediction and holistic prediction.
  • 5–7: Another 3 PLs are built analogically by classification output of DeepLabv3+ but this time using the GoogLeNet / Inception-v1 backbone.
  • 8–9: The last 2 PLs consist of the output of an in-house simple soiling segmentation network called MaskSegNet, where the first one using the quarter of the original image resolution and the second is using the half of the original image resolution, as input.
  • All networks were trained using the non-public part of the soiling part of the WoodScape.

2.3. Two-Stage Ensemble Learning Scheme

Two-stage Ensemble Learning
  • It consists of two consecutive stages.
  • In the first stage, one of the 9 pseudo-labels is randomly selected as the correct label and the ensemble network is trained to minimize the cross-entropy loss.
  • The second stage uses the same architecture as the first stage, but this time the nearest neighbor pseudo label to the prediction of the first stage classifier is selected as the correct label. The second stage network is also trained to minimize the cross-entropy loss.
  • The ensemble network architecture is outlined at Section 2.2.

3. Experimental Results

3.1. Settings

  • 20K samples in total are used.
  • They were split into train/val/test datasets in the ratio of 6:2:2.
  • To evaluate the refined (i.e. ensemble generated) annotations, two PSPNet models with ResNet-50 backbone were trained.
  • One model was trained with the manual annotations and the second one with the refined annotations.
  • The networks are trained using the train/val sets and evaluated using the test set.

3.2. Quantitative Results

Qualitative results of the two models trained with manual or ensemble annotations.
  • Model trained on manual annotation: Mean IoU score for this model on manual annotations is 65.26%. While the model performs well on clean and opaque classes it has relatively low score for transparent and semi-transparent classes. This is understandable as the transparent and semi-transparent classes are hard to annotate.
  • Model trained on ensemble annotation: The model trained on ensemble annotations improved by +5.43% in mean IoU score.
  • Comparison of results computed on Intersection set: Intersection test set labels are generated as the pixels that shares the same class in both manual and ensemble generated annotations, respectively.
  • While the IoU for the class “clean” stays the same, the IoU for the three other classes increased significantly for the model trained on the ensemble generated annotations (transparent: +4.55%, semi-transparent: +5.39% and opaque: 2.09%).
  • Along with the IoU KPI, the overall accuracy increased as well (+3.73%).

Hence, the conclusion that the ensemble generated annotations contain less label noise compared to the manual annotations is proven via the evaluation of the model performance on segmentation tasks.

3.3. Visualization Results

Left: Original Image, Middle: Manual Annotation, Right: Ensemble Refined Annotation

3.4. Manual Visual Inspection

Manual visual inspection score on test set
  • Manual inspection was done as an AB test by a team of 3 image annotation experts with the knowledge of the soiling detection task.
  • Each reviewer visually evaluated 280 images and had to select which annotation is better in his opinion. The number of images manually inspected covers 20% of the hold out test set.

It can be observed that in 42.70% of the cases, ensemble annotations where considered better than the manual annotation (14.59%).

It is important to notice, that in the 42.7% of the annotation that where found similar, clean images have been reviewed too.

Applying the idea of “A new ensemble learning framework for 3d biomedical image segmentation,” in 3D Biomedical Image Segmentation onto the soiling problem in autonomous driving.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store