Review — Segmenting Transparent Objects in the Wild
Trans10K & TransLab, Transparent Object Segmentation Dataset & Approach, are Proposed
Segmenting Transparent Objects in the Wild,
Trans10K & TransLab, by The University of Hong Kong, SenseTime Research, Nanjing University, and The University of Adelaide,
2020 ECCV, Over 40 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation
- Trans10K, a large transparent object segmentation dataset, is proposed.
- TransLab, a transparent object segmentation approach, is proposed using boundary segmentation with also the boundary attention module (BAM).
Outline
- Trans10K Dataset
- TransLab Approach (Lab: “Look at boundary”)
- Experimental Results
1. Trans10K Dataset
1.1. Dataset
- The Trans10K dataset contains 10,428 images, with two categories of transparent objects:
- Transparent things such as cups, bottles and glass, locating these things can make robots easier to grab objects.
- Transparent stuff such as windows, glass walls and glass doors. It can make robots learn to avoid obstacles and avoid hitting these stuff.
- For labeling, the background is with 0, transparent things are with 1 and transparent stuff is with 2.
1.2. Dataset Comparison
- There are 5000, 1000 and 4428 images for train, validation and test, respectively.
- The images are manually harvested from the internet, image library like Google OpenImage and own data captured by phone cameras.
- The distribution of the images is various, containing different scales, born-digital, perspective distortion glass, crowded and so on.
Trans10K is the largest real dataset focus on transparent object segmentation in the wild, which is over 10× larger than existing real transparent object datasets.
- As shown above, Trans10K has more diverse scene and challenging viewpoint, categories, occlusion than TransCut and TOM-Net.
1.3. Easy & Hard Cases
- The validation set and test set are further divided into two parts, easy and hard according to the difficulty.
1.4. More Statistics
- (a): The distribution of area ratio of connected components in each image.
- (b): The number of connected components of things and stuff in each image.
- (c): The distribution of the image resolution.
- (d): The distribution of the object location of the whole dataset.
2. TransLab Approach (Lab: “Look at boundary”)
2.1. (a) Whole Pipeline
- ResNet50 is the backbone.
- TransLab is composed of two parallel stream: regular stream and boundary stream.
- The regular stream is for transparent object segmentation while the boundary stream is for boundary prediction.
- The boundary ground-truth is a binary map with thickness 8. The label of the predicted boundary map is 1.
The boundary is easier than content to observe because it tends to have high contrast at the edge, which is consistent with human visual perception.
- Thus, the predicted boundary map can be used as a clue to attend the regular stream, which is implemented by Boundary Attention Module (BAM).
- In each stream, Atrous Spatial Pyramid Pooling module (ASPP), as in DeepLabv3, is used to enlarge the receive field.
- Finally, a simple decoder is designed to utilize both high-level feature (C4) and low-level feature (C1 and C2).
2.2. (b) Boundary Attention Module (BAM)
- BAM first takes the feature map of the regular stream and the predicted boundary map as input, then performs boundary attention with a boundary map.
- The two feature maps before and after boundary attention are concatenated and followed by a channel attention block.
- Finally, it outputs the refined feature map.
- BAM can be repeatedly used on the high-level and low-level features of regular stream such as C1, C2 and C4. It is demonstrated that the more times of boundary attention, the better performance of segmentation results.
2.3. (c) Decoder
- The input of decoder is C1, C2 and C4 (after ASPP).
- C4 and C2 are firstly fused by up-sampling C4 and adding 3×3 convolutional operation. The fused feature map is further up-sampled to fuse with C1 in the same approach.
- In this way, both high level and low-level feature maps are joint fused, which is beneficial for semantic segmentation.
2.4. Loss Function
- The total loss is:
- where Ls and Lb represent the losses for the segmentation text and boundary. λ=5.
- Ls is cross-entropy (CE) loss.
- Lb can be binary cross-entropy (BCE) loss, focal loss, or dice loss.
3. Experimental Results
3.1. Ablation Study
- DeepLabv3+ is used as the baseline.
- Only using boundary loss as auxiliary loss during training can directly improve the segmentation performance.
- More boundary attention leading to better results.
In summary, locating boundaries is essential for transparent object segmentation.
3.2. SOTA Comparison
- Each of the models is trained on the training set of our dataset and evaluate them on our testing set.
- Input image is with size of 512×512. Single scale training and testing are used.
TransLab outperforms all other methods in Trans10K in terms of all four metrics in both easy/hard set and things/stuff categories.
- Especially, TransLab outperforms DeepLabv3+, the SOTA semantic segmentation method, in a large gap on all metrics as well as both things and stuff, especially in hard set. For instance, it surpasses DeepLabV3 by 3.97% on ‘Acc’ (hard set).
3.3. Visualization
TransLab clearly outperforms others thanks to the boundary attention, especially in yellow dash region.
Reference
[2020 ECCV] [Trans10K, TransLab]
Segmenting Transparent Objects in the Wild
1.5. Semantic Segmentation / Scene Parsing
2015 … 2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]