Review — Segmenting Transparent Object in the Wild with Transformer
Trans10K-v2 Dataset & Trans2Seg Segmentation Approach
Segmenting Transparent Object in the Wild with Transformer,
Trans10K-v2 & Trans2Seg, by The University of Hong Kong, Sensetime Research, Nanjing University
2021 ICJAI, Over 40 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer
- Trans10K-v1 that only has two limited categories. In this paper, Trans10K-v2 is proposed, which has 11 fine-grained categories of transparent objects, making it more practical for real-world application, and more challenging.
- Trans2Seg is proposed, a novel Transformer-based segmentation pipeline, which helps to provide the global receptive field, with a set of learnable prototypes as the query where each prototype learns the statistics of one category.
Outline
- Trans10K-v2 Dataset
- Trans2Seg Approach
- Experimental Results
1. Trans10K-v2 Dataset
- Trans10K-v2 dataset is based on Trans10K-v1 dataset.
- The images are further annotated with more fine-grained categories due to the functional usages of different objects.
- Trans10K-v2 dataset contains 10,428 images, with two main categories and 11 fine-grained categories: (1) Transparent Things containing cup, bottle, jar, bowl and eyeglass. (2) Transparent Stuff containing windows, shelf, box, freezer, glass walls and glass doors.
- The background is labeled as 0, and the 11 categories are from 1 to 11.
- The sum of all the image numbers is larger than 10428 since some image has multiple category of objects.
Trans10K-v2 is very challenging, and have promising potential in both computer vision and robotic researches.
2. Trans2Seg Approach
The overall Trans2Seg architecture contains a CNN backbone, an encoder-decoder Transformer, and a small convolutional head.
2.1. Input
- With an input image of size (H, W, 3), the CNN backbone generates image feature map of size (H/16, W/16, 3).
2.2. Encoder
The encoder takes in the summation of flattened feature and positional embedding and outputs encoded feature of size (H/16, W/16, 3).
- The encoder is composed of stacked encoder layers, each of which consists of a multi-head self-attention (MHA) module and a feed forward network (FFN).
2.3. Decoder
The decoder interacts the learned class prototypes of (N, C) with encoded feature, and generates attention map of (N, M, H/16, W/16), where N is number of categories, M is number of heads in MHA.
- Specifically, the Transformer decoder takes input a set of learnable class prototype embeddings as query, denoted by Ecls, the encoded feature as key and value, denoted by Fe.
- The class prototype embeddings are learned category prototypes, updated iteratively by a series of decoder layers through MHA.
- Iterative update rule is denoted by ⊙, then the class prototype in each decoder layer is:
- In the final decoder layer, the attention map is extracted out to into small conv head:
2.4. Small convolutional head
- This small convolutional head up-samples the attention map to (N, M, H/4, W/4), fuses it with high-resolution feature map Res2 and outputs attention map of (N, H/4, W/4).
The final segmentation is obtained by pixel-wise argmax operation on the output attention map.
2.5. Differences from SETR and DETR
- Different from SETR, the decoder of Trans2Seg is also Transformer.
- In DETR, the decoder’s queries represents N learnable objects because DETRis designed for object detection. However, in Trans2Seg, the queries represents N learnable class prototypes, where each query represents one category.
3. Experimental Results
3.1. Ablation Studies
- FCN is used as baseline. The FCN baseline without Transformer encoder achieves 62.7% mIoU.
When adding Transformer encoder, the mIoU directly improves 6.1%, achieving 66.8% mIoU. With Transformer decoder, the mIoU boosts up to 72.1% with 3.3% improvement.
With the size of Transformer increase, the mIoU first increase then decrease. It is argued that if without massive data to pretrain, e.g. BERT used large-scale NLP data, the Transformer size is not the larger the better for the task in this paper.
3.2. SOTA Comparison
Trans2Seg achieves state-of-the-art 72.15% mIoU and 94.14% pixel ACC, significant outperforms other pure CNN-based methods. For example, Trans2Seg is 2.1% higher than TransLab.
- Trans2Seg tends to perform much better on small objects, such as ‘bottle’ and ’eyeglass’ (10.0% and 5.0% higher than previous SOTA). The Transformer’s long range attention benefits the small transparent object segmentation.
3.2. Visualizations
Trans2Seg benefits from Transformer’s large receptive field and attention mechanism, it can distinguish background and different categories transparent objects much better than other methods.
Trans2Seg fails to segment transparent objects in some complex scenarios.
3.3. ADE20K: General Semantic Segmentation
Trans2Seg also works well on general semantic segmentation tasks, without carefully tuned the hyperparameters of Trans2Seg on ADE20K dataset.
Reference
[2021 ICJAI] [Trans10K-v2 & Trans2Seg]
Segmenting Transparent Object in the Wild with Transformer
1.5. Semantic Segmentation / Scene Parsing
2015 … 2020 [DRRN Zhang JNCA’20] [Trans10K, TransLab] 2021 [PVT, PVTv1] [SETR] [Trans10K-v2 & Trans2Seg] 2022 [PVTv2]