Review — Budget-aware Semi-Supervised Semantic and Instance Segmentation

Pseudo Labeling for Semi-Supervised Image Segmentation

6 min readMar 15, 2023

--

Budget-aware Semi-Supervised Semantic and Instance Segmentation,
Bellver CVPRW’19, by Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya,
2019 CVPRW, Over 20 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Semantic Segmentation, Instance Segmentation
Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014 … 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] 2022 [PVTv2] [YOLACT++]
==== My Other Paper Readings Are Also Over Here ====

Generally, the annotation burden is mitigated by labeling datasets with weaker forms of supervision, e.g. image-level labels or bounding boxes. Another option are semi-supervised settings.
In this paper, with a very simple pipeline, at low annotation budgets (reduced labeling cost), semi-supervised methods outperform by a wide margin weakly-supervised ones for both semantic and instance segmentation.

Outline

Annotation Cost
Pseudo-Labeling for Semantic/Instance Segmentation
Semantic Segmentation Results
Instance Segmentation Results
Training with Heterogeneous Annotations (W-RSIS)

1. Annotation Cost

**Average annotation cost per image when using different types of supervision.**

On MS COCO dataset:

on average 1.5 class categories are present in each image,
on average there are 2.8 objects per image, and
there is a total of 20 class categories.

Four levels of supervision are considered on COCO dataset: image-level, image-level labels + object counts, bounding boxes, and full supervision. The summary is as shown above.

1.1. Image-Level (IL)

According to [3], the time to verify the presence of a class in an image is of 1 second.
Then, the cost is of t_IL = 20 classes/image × 1s/class = 20s/image.

1.2. Image-Level + Counts (IL+C)

According to [8], the counting increases the annotation time to 1.48s per class.
t_IL+C = t_IL+ 1.5 classes/image × 1.48 s/class = 22.22 s/image.

1.3. Full Supervision (Full)

Reported by [3], t_Full = 18.5 classes/image × 1s/class + 2.8mask/image × 79 s/mask = 239.7 s/image.

1.4. Bounding Boxes (BB)

Recent techniques [18] have cut the cost of annotating a bounding box to 7.0 s/box.
The cost of annotating a Pascal VOC image with bounding boxes is t_bb = 18.5 classes/image × 1s/class + 2.8 bb/image × 7 s/bb = 38.1 s/image.

2. Pseudo-Labeling for Semantic/Instance Segmentation

**Semi-supervised training pipeline consists of two networks, an annotation network trained with strong supervision, and a segmentation network trained with the union of** **pseudo-annotations** **and strong-labeled samples.**

The pipeline consists of two different networks with 3 steps as shown above:

A first fully supervised model fθ is trained with strong-labeled samples from the ground truth (X, Y)={(x1, y1), …, (xN, yN)}, being N the total number of strong samples.
The network fθ is an annotation network used to predict pseudo-labels Y′= {y′1, …, y′M} for M unlabeled samples X′={x′1, …, x′M}.
A second segmentation network gφ is trained with (X, Y) ∪ (X′, Y′), as depicted in the figure above.

3. Semantic Segmentation

fθ and gφ have the same architecture, a DeepLabv3+ framework with an Xception-65 encoder is used.

**Performance of DeepLab-v3+ for the validation and test set of Pascal VOC 2012 with different supervision setups.**

DeepLabv3+, with both the strong-labels Y and the pseudo-labels Y′ obtained with fθ, which represents a small improvement.

**Semantic segmentation performance of the annotation and segmentation networks for an increasing budget for the validation set of Pascal VOC.**

With a varying number of strong-labeled training samples N ∈ {100, 200, 400, 800, 1464}.

Given a certain budget, the mIoU of gφN is always higher than the one obtained with the fθN alone, and therefore the extra pseudo-labels improve the performance. This suggests that pseudo-annotations can increase the quality of the segmentation tool at no additional cost.

**Semantic Segmentation comparison for the validation set of Pascal VOC with other semi-supervised (SS) and weakly-supervised (WS) methods, that use image-level labels (IL) or bounding box labels (BB).**

DeepLabv3+ with proposed semi-supervised approach, outperforms all previous methods (weakly or semi-supervised) at same or lower annotation budgets, setting a new state of the art of 79.41 mIoU for semi-supervised segmentation, using strong supervision only.
The proposed approach outperforms all weakly-supervised approaches when matching the annotation cost.

**Visualization of Pascal VOC validation set for the annotation fθN (A-) and segmentation networks gφN (S-), depending on the number of strong labels used N ∈ {200, 400, 800} .**

As expected, gφN obtains better segmentation results than its counterpart fθN.

4. Instance Segmentation

**Performance of RSIS for the validation set of Pascal VOC 2012 with different supervision setups.**

The recurrent architecture for instance segmentation RSIS [26] is used for both fθ and gφ.

A similar behavior to the semantic segmentation case is observed.

**Instance segmentation performance of the annotation and segmentation networks for an increasing budget for the validation set of Pascal VOC.**

The performance gap between fθN and gφN is more significant for the instance segmentation task.

When matching the annotation cost, the proposed semi-supervised approach reaches significant better performance.

**Visualization of Pascal VOC validation set for the instance segmentation network gφN (S-) with N ∈ {100, 200, 400, 800, 1464} and M = 10582 − N.**

The higher the N, the better the network distinguishes between different instances.

5. Training with Heterogeneous Annotations (W-RSIS)

**RSIS architecture in the first row, and W-RSIS architecture in the second.**

Let Z be the IL+C labels for the strongly-annotated subset (X, Y, Z), and Z′ the IL+C labels for the weakly-annotated subset (X′, Z′).
To exploit the weak-labels Z′, now fθ during training will receive as input (X, Z), and will be optimized to predict Y′.
fθ is W-RSIS, which is a modified RSIS [26].
RSIS [26] consists in an encoder-decoder architecture. The encoder is a ResNet-101, and the decoder is formed by a set of stacked ConvLSTM.
At each time step, a binary mask and a class category for each object of the image is predicted by the decoder. The architecture also has a stop branch that indicates if all objects have been covered.

For W-RSIS, the main difference is that, besides the features extracted by the encoder, the decoder receives at each time step a one-hot encoding of a class category (e.g.: sheep) representing each of the instances of the image.

**(a) Ablation study of IL+C as inputs with the Pascal validation set. (b) Ablation study of different losses with the Pascal validation set.**

The ablation study above shows how the proposed W-RSIS architecture maximizes the information contained in the IL+C weak labels.

**Comparison of RSIS annotation network.**

W-RSIS generates better annotations compared to RSIS at different annotation budgets.

**Comparison of** **pseudo-annotations** **obtained by RSIS (first row) and W-RSIS (second row) with N = 800.**

The knowledge about the category of the pseudo-annotation provided by the class label facilitates the task, resulting in better quality masks.