Review — Budget-aware Semi-Supervised Semantic and Instance Segmentation
Pseudo Labeling for Semi-Supervised Image Segmentation
Budget-aware Semi-Supervised Semantic and Instance Segmentation,
Bellver CVPRW’19, by Barcelona Supercomputing Center, and Universitat Politècnica de Catalunya,
2019 CVPRW, Over 20 Citations (Sik-Ho Tsang @ Medium)
Semi-Supervised Learning, Semantic Segmentation, Instance SegmentationSemantic Segmentation / Scene Parsing / Instance Segmentation / Panoptic Segmentation
2014 … 2021 [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] [Copy-Paste] 2022 [PVTv2] [YOLACT++]
==== My Other Paper Readings Are Also Over Here ====
- Generally, the annotation burden is mitigated by labeling datasets with weaker forms of supervision, e.g. image-level labels or bounding boxes. Another option are semi-supervised settings.
- In this paper, with a very simple pipeline, at low annotation budgets (reduced labeling cost), semi-supervised methods outperform by a wide margin weakly-supervised ones for both semantic and instance segmentation.
Outline
- Annotation Cost
- Pseudo-Labeling for Semantic/Instance Segmentation
- Semantic Segmentation Results
- Instance Segmentation Results
- Training with Heterogeneous Annotations (W-RSIS)
1. Annotation Cost
- On MS COCO dataset:
- on average 1.5 class categories are present in each image,
- on average there are 2.8 objects per image, and
- there is a total of 20 class categories.
- Four levels of supervision are considered on COCO dataset: image-level, image-level labels + object counts, bounding boxes, and full supervision. The summary is as shown above.
1.1. Image-Level (IL)
- According to [3], the time to verify the presence of a class in an image is of 1 second.
- Then, the cost is of t_IL = 20 classes/image × 1s/class = 20s/image.
1.2. Image-Level + Counts (IL+C)
- According to [8], the counting increases the annotation time to 1.48s per class.
- t_IL+C = t_IL+ 1.5 classes/image × 1.48 s/class = 22.22 s/image.
1.3. Full Supervision (Full)
- Reported by [3], t_Full = 18.5 classes/image × 1s/class + 2.8mask/image × 79 s/mask = 239.7 s/image.
1.4. Bounding Boxes (BB)
- Recent techniques [18] have cut the cost of annotating a bounding box to 7.0 s/box.
- The cost of annotating a Pascal VOC image with bounding boxes is t_bb = 18.5 classes/image × 1s/class + 2.8 bb/image × 7 s/bb = 38.1 s/image.
2. Pseudo-Labeling for Semantic/Instance Segmentation
- The pipeline consists of two different networks with 3 steps as shown above:
- A first fully supervised model fθ is trained with strong-labeled samples from the ground truth (X, Y)={(x1, y1), …, (xN, yN)}, being N the total number of strong samples.
- The network fθ is an annotation network used to predict pseudo-labels Y′= {y′1, …, y′M} for M unlabeled samples X′={x′1, …, x′M}.
- A second segmentation network gφ is trained with (X, Y) ∪ (X′, Y′), as depicted in the figure above.
3. Semantic Segmentation
- fθ and gφ have the same architecture, a DeepLabv3+ framework with an Xception-65 encoder is used.
DeepLabv3+, with both the strong-labels Y and the pseudo-labels Y′ obtained with fθ, which represents a small improvement.
- With a varying number of strong-labeled training samples N ∈ {100, 200, 400, 800, 1464}.
Given a certain budget, the mIoU of gφN is always higher than the one obtained with the fθN alone, and therefore the extra pseudo-labels improve the performance. This suggests that pseudo-annotations can increase the quality of the segmentation tool at no additional cost.
DeepLabv3+ with proposed semi-supervised approach, outperforms all previous methods (weakly or semi-supervised) at same or lower annotation budgets, setting a new state of the art of 79.41 mIoU for semi-supervised segmentation, using strong supervision only.
The proposed approach outperforms all weakly-supervised approaches when matching the annotation cost.
As expected, gφN obtains better segmentation results than its counterpart fθN.
4. Instance Segmentation
- The recurrent architecture for instance segmentation RSIS [26] is used for both fθ and gφ.
A similar behavior to the semantic segmentation case is observed.
The performance gap between fθN and gφN is more significant for the instance segmentation task.
When matching the annotation cost, the proposed semi-supervised approach reaches significant better performance.
The higher the N, the better the network distinguishes between different instances.
5. Training with Heterogeneous Annotations (W-RSIS)
- Let Z be the IL+C labels for the strongly-annotated subset (X, Y, Z), and Z′ the IL+C labels for the weakly-annotated subset (X′, Z′).
- To exploit the weak-labels Z′, now fθ during training will receive as input (X, Z), and will be optimized to predict Y′.
- fθ is W-RSIS, which is a modified RSIS [26].
- RSIS [26] consists in an encoder-decoder architecture. The encoder is a ResNet-101, and the decoder is formed by a set of stacked ConvLSTM.
- At each time step, a binary mask and a class category for each object of the image is predicted by the decoder. The architecture also has a stop branch that indicates if all objects have been covered.
For W-RSIS, the main difference is that, besides the features extracted by the encoder, the decoder receives at each time step a one-hot encoding of a class category (e.g.: sheep) representing each of the instances of the image.
The ablation study above shows how the proposed W-RSIS architecture maximizes the information contained in the IL+C weak labels.
W-RSIS generates better annotations compared to RSIS at different annotation budgets.
The knowledge about the category of the pseudo-annotation provided by the class label facilitates the task, resulting in better quality masks.
e.g.: For very low annotation budgets (0.55 days), W-RSIS outperforms RSIS.
Qualitative results are shown for W-RSIS for different numbers of weak-labeled samples.