Review — SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

SETR, Ranks 1st Place in the Highly Competitive ADE20K Test Server Leaderboard

6 min readNov 7, 2022

--

Schematic illustration of the proposed SEgmentation TRansformer (SETR) (a). SETR first splits an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. To perform pixel-wise segmentation, different decoder designs are introduced: (b) progressive upsampling (resulting in a variant called SETR-PUP); and © multi-level feature aggregation (a variant called SETR-MLA).

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,
SETR, by Fudan University, University of Oxford, University of Surrey, Tencent Youtu Lab, and Facebook AI
2021 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer, Vision Transformer, ViT

Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged.
In this paper, SEgmentation TRansformer (SETR) is proposed, which treats semantic segmentation as a sequence-to-sequence prediction task, via a pure Transformer, to encode an image as a sequence of patches.

Outline

SEgmentation TRansformer (SETR) Input & Encoder
SETR Decoder Designs
Ablation Studies
Experimental Results

1. SEgmentation TRansformer (SETR) Input & Encoder

1.1. Input

A straightforward way for image sequentialization is to flatten the image pixel values into a 1D vector with size of 3HW. However, quadratic model complexity of Transformer is introduced. Instead:

An image x of size H×W×3 is divided into a grid of H/16×W/16 patches uniformly, and then flatten this grid into a sequence.

By further mapping each vectorized patch p into a latent C-dimensional embedding space using a linear projection function f: p→e, a 1D sequence of patch embeddings for an image x is obtained.
To encode the patch spacial information, a specific embedding pi is learnt for every location i, which is added to ei to form the final sequence input E={e1+p1, e2+p2, …, eL+pL}. This way, spatial information is kept despite the orderless self-attention nature of transformers.

1.2. Transformer

(Please feel free to read Transformer if interested.)
The Transformer encoder consists of Le layers of multi-head self-attention (MSA) and Multilayer Perceptron (MLP) blocks.

Layer norm is applied before MSA and MLP blocks.

{Z1, Z2, · · · , ZLe} are the features of Transformer layers.

2. SETR Decoder Designs

3 designs are proposed. Naive one is the baseline.

2.1. SETR-Naïve: Naïve upsampling (Naïve)

A simple 2-layer network with architecture: 1×1 conv + sync batch norm (w/ ReLU) + 1×1 conv. After that, the output is simply bilinearly upsampled to the full image resolution, followed by a classification layer with pixel-wise cross-entropy loss.

2.2. SETR-PUP: Progressive UPsampling (PUP)

**Progressive upsampling (resulting in a variant called SETR-PUP)**

A progressive upsampling strategy that alternates conv layers and upsampling operations, is proposed.
To maximally mitigate the adversarial effect, we restrict upsampling to 2×. Hence, a total of 4 operations are needed for reaching the full resolution.

2.3. SETR-MLA: Multi-Level feature Aggregation (MLA)

**Multi-level feature aggregation (a variant called SETR-MLA)**

MLA has a similar spirit of feature pyramid network (FPN). However, the decoder is fundamentally different because the feature representations Zl of every SETR’s layer share the same resolution without a pyramid shape.
The encoder’s feature Zl is first reshaped from a 2D shape of HW/256×C to a 3D feature map H/16×W/16×C.
A 3-layer (kernel size 1×1, 3×3, and 3×3) network is applied with the feature channels halved at the first and third layers respectively, and the spatial resolution upscaled 4× by bilinear operation after the third layer.
A top-down aggregation design is applied via element-wise addition after the first layer. An additional 3×3 conv is applied hereafter.
The fused feature from all the streams via channel-wise concatenation which is then bilinearly upsampled 4× to the full resolution.

3. Ablation Studies

3.1. Model Variants

**Configuration of** **Transformer** **backbone variants.**

Two variants of the encoder: “T-Base” and “T-Large” with 12 and 24 layers respectively. “T-Large” is used as the encoder for SETR-Naïve, SETR-PUP, and SETR-MLA. SETR-Naïve-Base is denoted as the model utilizing “T-Base” in SETR-Naïve.
Besides SETR-Naïve, SETR-PUP, and SETR-MLA, a hybrid baseline Hybrid is also designed for comparison, by using a ResNet-50 based FCN encoder and feeding its output feature into SETR. This Hybrid is a combination of ResNet-50 and SETR-Naïve-Base.
There are pre-trained weights provided by ViT or DeiT.

3.2. Different Pre-Training Strategies and Backbones

**Comparing SETR variants on different pre-training strategies and backbones.**

By progressively upsampling the feature maps, SETR-PUP achieves the best performance among all the variants on Cityscapes.

The variants using “T-Large” (e.g., SETR-MLA and SETR-Naïve) are superior to their “T-Base” counterparts.
Randomly initialized SETR-PUP only gives 42.27% mIoU on Cityscapes. Model pre-trained with DeiT on ImageNet-1K gives the best performance on Cityscapes, slightly better than the counterpart pre-trained with ViT on ImageNet-21K.

3.3. Different Pre-Training Strategies

**Comparison to** **FCN** **with different pre-training with single-scale inference on the** **ADE20K** **val and Cityscapes val set.**

With ImageNet-21k pre-training FCN baseline experienced a clear improvement over the variant pre-trained on ImageNet-1k.

However, SETR method outperforms the FCN counterparts by a large margin, verifying that the advantage of the proposed approach largely comes from the proposed sequence-to-sequence modeling strategy rather than bigger pre-training data.

4. Experimental Results

4.1. ADE20K

**State-of-the-art comparison on the** **ADE20K** **dataset.**

SETR-MLA achieves superior mIoU of 48.64% with single-scale (SS) inference. When multi-scale inference is adopted, 50.28% mIoU is obtained.

The proposed method ranks 1st place in the highly competitive ADE20K test server leaderboard.

**Qualitative results on** **ADE20K: SETR (right column) vs.** **dilated** **FCN** **baseline (left column)**

4.2. Pascal Context

**State-of-the-art comparison on the Pascal Context dataset.**

The proposed SETR significantly outperforms this baseline, achieving mIoU of 54.40% (SETR-PUP) and 54.87% (SETR-MLA).

SETR-MLA further improves the performance to 55.83% when multi-scale (MS) inference is adopted.

**Qualitative results on Pascal Context: SETR (right column) vs.** **dilated** **FCN** **baseline (left column)**

4.3. Cityscapes

**State-of-the-art comparison on the Cityscapes validation set.**

**Comparison on the Cityscapes test set.**

SETR-PUP is superior to FCN baselines, and FCN plus attention based approaches, such as Non-local, and CCNet [24].
SETR, using smaller image size, is still superior to Axial-DeepLab when multi-scale inference is adopted on Cityscapes validation set.

Using the fine set only, SETR model (trained with 100k iterations) outperforms Axial-DeepLab-XL with a clear margin on the test set.

**Qualitative results on Cityscapes: SETR (right column) vs.** **dilated** **FCN** **baseline (left column)**

Reference

[2021 CVPR] [SETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

1.6. Semantic Segmentation / Scene Parsing

2015 … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

Review — SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

SETR, Ranks 1st Place in the Highly Competitive ADE20K Test Server Leaderboard

Outline

1. SEgmentation TRansformer (SETR) Input & Encoder

1.1. Input

1.2. Transformer

2. SETR Decoder Designs

2.1. SETR-Naïve: Naïve upsampling (Naïve)

2.2. SETR-PUP: Progressive UPsampling (PUP)

2.3. SETR-MLA: Multi-Level feature Aggregation (MLA)

3. Ablation Studies

3.1. Model Variants

3.2. Different Pre-Training Strategies and Backbones

3.3. Different Pre-Training Strategies

4. Experimental Results

4.1. ADE20K

4.2. Pascal Context

4.3. Cityscapes

Reference

1.6. Semantic Segmentation / Scene Parsing

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet