Review — Segmenting Transparent Object in the Wild with Transformer

Trans10K-v2 Dataset & Trans2Seg Segmentation Approach

5 min readDec 29, 2022

**Images in Trans10K-v2 dataset are carefully annotated with high quality.**

Segmenting Transparent Object in the Wild with Transformer,
Trans10K-v2 & Trans2Seg, by The University of Hong Kong, Sensetime Research, Nanjing University
2021 ICJAI, Over 40 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer

Trans10K-v1 that only has two limited categories. In this paper, Trans10K-v2 is proposed, which has 11 fine-grained categories of transparent objects, making it more practical for real-world application, and more challenging.
Trans2Seg is proposed, a novel Transformer-based segmentation pipeline, which helps to provide the global receptive field, with a set of learnable prototypes as the query where each prototype learns the statistics of one category.

Outline

Trans10K-v2 Dataset
Trans2Seg Approach
Experimental Results

1. Trans10K-v2 Dataset

**More images and corresponding high-quality masks in Trans10K-v2.**

**Statistic information of Trans10K-v2.** ‘CMCC’ denotes Mean Connected Components of each category.

Trans10K-v2 dataset is based on Trans10K-v1 dataset.
The images are further annotated with more fine-grained categories due to the functional usages of different objects.
Trans10K-v2 dataset contains 10,428 images, with two main categories and 11 fine-grained categories: (1) Transparent Things containing cup, bottle, jar, bowl and eyeglass. (2) Transparent Stuff containing windows, shelf, box, freezer, glass walls and glass doors.
The background is labeled as 0, and the 11 categories are from 1 to 11.
The sum of all the image numbers is larger than 10428 since some image has multiple category of objects.

Trans10K-v2 is very challenging, and have promising potential in both computer vision and robotic researches.

2. Trans2Seg Approach

**The whole pipeline of our hybrid CNN-Transformer** **architecture.**

The overall Trans2Seg architecture contains a CNN backbone, an encoder-decoder Transformer, and a small convolutional head.

2.1. Input

With an input image of size (H, W, 3), the CNN backbone generates image feature map of size (H/16, W/16, 3).

2.2. Encoder

The encoder takes in the summation of flattened feature and positional embedding and outputs encoded feature of size (H/16, W/16, 3).

The encoder is composed of stacked encoder layers, each of which consists of a multi-head self-attention (MHA) module and a feed forward network (FFN).

2.3. Decoder

The decoder interacts the learned class prototypes of (N, C) with encoded feature, and generates attention map of (N, M, H/16, W/16), where N is number of categories, M is number of heads in MHA.

**Detail of** **Transformer** **Decoder and small conv head.**

Specifically, the Transformer decoder takes input a set of learnable class prototype embeddings as query, denoted by Ecls, the encoded feature as key and value, denoted by Fe.
The class prototype embeddings are learned category prototypes, updated iteratively by a series of decoder layers through MHA.
Iterative update rule is denoted by ⊙, then the class prototype in each decoder layer is:

In the final decoder layer, the attention map is extracted out to into small conv head:

2.4. Small convolutional head

This small convolutional head up-samples the attention map to (N, M, H/4, W/4), fuses it with high-resolution feature map Res2 and outputs attention map of (N, H/4, W/4).

The final segmentation is obtained by pixel-wise argmax operation on the output attention map.

2.5. Differences from SETR and DETR

Different from SETR, the decoder of Trans2Seg is also Transformer.
In DETR, the decoder’s queries represents N learnable objects because DETRis designed for object detection. However, in Trans2Seg, the queries represents N learnable class prototypes, where each query represents one category.

3. Experimental Results

3.1. Ablation Studies

**Effectiveness of** **Transformer** **encoder and decoder.**

FCN is used as baseline. The FCN baseline without Transformer encoder achieves 62.7% mIoU.

When adding Transformer encoder, the mIoU directly improves 6.1%, achieving 66.8% mIoU. With Transformer decoder, the mIoU boosts up to 72.1% with 3.3% improvement.

**Performance of** **Transformer** **at different scales.**

With the size of Transformer increase, the mIoU first increase then decrease. It is argued that if without massive data to pretrain, e.g. BERT used large-scale NLP data, the Transformer size is not the larger the better for the task in this paper.

3.2. SOTA Comparison

**Evaluated state-of-the-art semantic segmentation methods. Sorted by FLOPs.**

**Performance comparison on Trans10K-v2.**

Trans2Seg achieves state-of-the-art 72.15% mIoU and 94.14% pixel ACC, significant outperforms other pure CNN-based methods. For example, Trans2Seg is 2.1% higher than TransLab.

Trans2Seg tends to perform much better on small objects, such as ‘bottle’ and ’eyeglass’ (10.0% and 5.0% higher than previous SOTA). The Transformer’s long range attention benefits the small transparent object segmentation.

3.2. Visualizations

**Visual comparison of Trans2Seg to other CNN-based semantic segmentation methods.**

Trans2Seg benefits from Transformer’s large receptive field and attention mechanism, it can distinguish background and different categories transparent objects much better than other methods.