Review — Budget-aware Semi-Supervised Semantic and Instance Segmentation

Pseudo Labeling for Semi-Supervised Image Segmentation

Sik-Ho Tsang
6 min readMar 15, 2023

Budget-aware Semi-Supervised Semantic and Instance Segmentation,
Bellver CVPRW’19, by Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya,
2019 CVPRW, Over 20 Citations (

@ Medium)
Semi-Supervised Learning, Semantic Segmentation, Instance Segmentation

Semantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014 … 2021
[PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] 2022 [PVTv2] [YOLACT++]
==== My Other Paper Readings Are Also Over Here ====

  • Generally, the annotation burden is mitigated by labeling datasets with weaker forms of supervision, e.g. image-level labels or bounding boxes. Another option are semi-supervised settings.
  • In this paper, with a very simple pipeline, at low annotation budgets (reduced labeling cost), semi-supervised methods outperform by a wide margin weakly-supervised ones for both semantic and instance segmentation.


  1. Annotation Cost
  2. Pseudo-Labeling for Semantic/Instance Segmentation
  3. Semantic Segmentation Results
  4. Instance Segmentation Results
  5. Training with Heterogeneous Annotations (W-RSIS)

1. Annotation Cost

Average annotation cost per image when using different types of supervision.
  • On MS COCO dataset:
  1. on average 1.5 class categories are present in each image,
  2. on average there are 2.8 objects per image, and
  3. there is a total of 20 class categories.
  • Four levels of supervision are considered on COCO dataset: image-level, image-level labels + object counts, bounding boxes, and full supervision. The summary is as shown above.

1.1. Image-Level (IL)

  • According to [3], the time to verify the presence of a class in an image is of 1 second.
  • Then, the cost is of t_IL = 20 classes/image × 1s/class = 20s/image.

1.2. Image-Level + Counts (IL+C)

  • According to [8], the counting increases the annotation time to 1.48s per class.
  • t_IL+C = t_IL+ 1.5 classes/image × 1.48 s/class = 22.22 s/image.

1.3. Full Supervision (Full)

  • Reported by [3], t_Full = 18.5 classes/image × 1s/class + 2.8mask/image × 79 s/mask = 239.7 s/image.

1.4. Bounding Boxes (BB)

  • Recent techniques [18] have cut the cost of annotating a bounding box to 7.0 s/box.
  • The cost of annotating a Pascal VOC image with bounding boxes is t_bb = 18.5 classes/image × 1s/class + 2.8 bb/image × 7 s/bb = 38.1 s/image.

2. Pseudo-Labeling for Semantic/Instance Segmentation

Semi-supervised training pipeline consists of two networks, an annotation network trained with strong supervision, and a segmentation network trained with the union of pseudo-annotations and strong-labeled samples.
  • The pipeline consists of two different networks with 3 steps as shown above:
  1. A first fully supervised model  is trained with strong-labeled samples from the ground truth (X, Y)={(x1, y1), …, (xN, yN)}, being N the total number of strong samples.
  2. The network is an annotation network used to predict pseudo-labels Y′= {y′1, …, yM} for M unlabeled samples X′={x′1, …, xM}.
  3. A second segmentation network gφ is trained with (X, Y) ∪ (X′, Y′), as depicted in the figure above.

3. Semantic Segmentation

  • and have the same architecture, a DeepLabv3+ framework with an Xception-65 encoder is used.
Performance of DeepLab-v3+ for the validation and test set of Pascal VOC 2012 with different supervision setups.

DeepLabv3+, with both the strong-labels Y and the pseudo-labels Y′ obtained with , which represents a small improvement.

Semantic segmentation performance of the annotation and segmentation networks for an increasing budget for the validation set of Pascal VOC.
  • With a varying number of strong-labeled training samples N ∈ {100, 200, 400, 800, 1464}.

Given a certain budget, the mIoU of gφN is always higher than the one obtained with the fθN alone, and therefore the extra pseudo-labels improve the performance. This suggests that pseudo-annotations can increase the quality of the segmentation tool at no additional cost.

Semantic Segmentation comparison for the validation set of Pascal VOC with other semi-supervised (SS) and weakly-supervised (WS) methods, that use image-level labels (IL) or bounding box labels (BB).

DeepLabv3+ with proposed semi-supervised approach, outperforms all previous methods (weakly or semi-supervised) at same or lower annotation budgets, setting a new state of the art of 79.41 mIoU for semi-supervised segmentation, using strong supervision only.

The proposed approach outperforms all weakly-supervised approaches when matching the annotation cost.

Visualization of Pascal VOC validation set for the annotation N (A-) and segmentation networks N (S-), depending on the number of strong labels used N ∈ {200, 400, 800} .

As expected, gφN obtains better segmentation results than its counterpart fθN.

4. Instance Segmentation

Performance of RSIS for the validation set of Pascal VOC 2012 with different supervision setups.
  • The recurrent architecture for instance segmentation RSIS [26] is used for both and .

A similar behavior to the semantic segmentation case is observed.

Instance segmentation performance of the annotation and segmentation networks for an increasing budget for the validation set of Pascal VOC.

The performance gap between fθN and gφN is more significant for the instance segmentation task.

When matching the annotation cost, the proposed semi-supervised approach reaches significant better performance.

Visualization of Pascal VOC validation set for the instance segmentation network N (S-) with N ∈ {100, 200, 400, 800, 1464} and M = 10582 − N.

The higher the N, the better the network distinguishes between different instances.

5. Training with Heterogeneous Annotations (W-RSIS)

RSIS architecture in the first row, and W-RSIS architecture in the second.
  • Let Z be the IL+C labels for the strongly-annotated subset (X, Y, Z), and Z′ the IL+C labels for the weakly-annotated subset (X′, Z′).
  • To exploit the weak-labels Z′, now during training will receive as input (X, Z), and will be optimized to predict Y.
  • is W-RSIS, which is a modified RSIS [26].
  • RSIS [26] consists in an encoder-decoder architecture. The encoder is a ResNet-101, and the decoder is formed by a set of stacked ConvLSTM.
  • At each time step, a binary mask and a class category for each object of the image is predicted by the decoder. The architecture also has a stop branch that indicates if all objects have been covered.

For W-RSIS, the main difference is that, besides the features extracted by the encoder, the decoder receives at each time step a one-hot encoding of a class category (e.g.: sheep) representing each of the instances of the image.

(a) Ablation study of IL+C as inputs with the Pascal validation set. (b) Ablation study of different losses with the Pascal validation set.

The ablation study above shows how the proposed W-RSIS architecture maximizes the information contained in the IL+C weak labels.

Comparison of RSIS annotation network.

W-RSIS generates better annotations compared to RSIS at different annotation budgets.

Comparison of pseudo-annotations obtained by RSIS (first row) and W-RSIS (second row) with N = 800.

The knowledge about the category of the pseudo-annotation provided by the class label facilitates the task, resulting in better quality masks.

Results of the segmentation network when the annotation network changes (RSIS vs. W-RSIS) at different fixed annotation budgets (in days).

e.g.: For very low annotation budgets (0.55 days), W-RSIS outperforms RSIS.

Visualization of Pascal VOC validation set for the instance segmentation network gφ when the annotation network is W-RSIS.

Qualitative results are shown for W-RSIS for different numbers of weak-labeled samples.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.