Review — SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

SETR, Ranks 1st Place in the Highly Competitive ADE20K Test Server Leaderboard

Sik-Ho Tsang
6 min readNov 7, 2022
Schematic illustration of the proposed SEgmentation TRansformer (SETR) (a). SETR first splits an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. To perform pixel-wise segmentation, different decoder designs are introduced: (b) progressive upsampling (resulting in a variant called SETR-PUP); and © multi-level feature aggregation (a variant called SETR-MLA).

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,
SETR, by Fudan University, University of Oxford, University of Surrey, Tencent Youtu Lab, and Facebook AI
2021 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer, Vision Transformer, ViT

  • Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged.
  • In this paper, SEgmentation TRansformer (SETR) is proposed, which treats semantic segmentation as a sequence-to-sequence prediction task, via a pure Transformer, to encode an image as a sequence of patches.


  1. SEgmentation TRansformer (SETR) Input & Encoder
  2. SETR Decoder Designs
  3. Ablation Studies
  4. Experimental Results

1. SEgmentation TRansformer (SETR) Input & Encoder

SETR Overall Framework

1.1. Input

  • A straightforward way for image sequentialization is to flatten the image pixel values into a 1D vector with size of 3HW. However, quadratic model complexity of Transformer is introduced. Instead:

An image x of size H×W×3 is divided into a grid of H/16×W/16 patches uniformly, and then flatten this grid into a sequence.

  • By further mapping each vectorized patch p into a latent C-dimensional embedding space using a linear projection function f: pe, a 1D sequence of patch embeddings for an image x is obtained.
  • To encode the patch spacial information, a specific embedding pi is learnt for every location i, which is added to ei to form the final sequence input E={e1+p1, e2+p2, …, eL+pL}. This way, spatial information is kept despite the orderless self-attention nature of transformers.

1.2. Transformer

  • (Please feel free to read Transformer if interested.)
  • The Transformer encoder consists of Le layers of multi-head self-attention (MSA) and Multilayer Perceptron (MLP) blocks.
  • Layer norm is applied before MSA and MLP blocks.

{Z1, Z2, · · · , ZLe} are the features of Transformer layers.

2. SETR Decoder Designs

  • 3 designs are proposed. Naive one is the baseline.

2.1. SETR-Naïve: Naïve upsampling (Naïve)

  • A simple 2-layer network with architecture: 1×1 conv + sync batch norm (w/ ReLU) + 1×1 conv. After that, the output is simply bilinearly upsampled to the full image resolution, followed by a classification layer with pixel-wise cross-entropy loss.

2.2. SETR-PUP: Progressive UPsampling (PUP)

Progressive upsampling (resulting in a variant called SETR-PUP)
  • A progressive upsampling strategy that alternates conv layers and upsampling operations, is proposed.
  • To maximally mitigate the adversarial effect, we restrict upsampling to 2×. Hence, a total of 4 operations are needed for reaching the full resolution.

2.3. SETR-MLA: Multi-Level feature Aggregation (MLA)

Multi-level feature aggregation (a variant called SETR-MLA)
  • MLA has a similar spirit of feature pyramid network (FPN). However, the decoder is fundamentally different because the feature representations Zl of every SETR’s layer share the same resolution without a pyramid shape.
  • The encoder’s feature Zl is first reshaped from a 2D shape of HW/256×C to a 3D feature map H/16×W/16×C.
  • A 3-layer (kernel size 1×1, 3×3, and 3×3) network is applied with the feature channels halved at the first and third layers respectively, and the spatial resolution upscaled 4× by bilinear operation after the third layer.
  • A top-down aggregation design is applied via element-wise addition after the first layer. An additional 3×3 conv is applied hereafter.
  • The fused feature from all the streams via channel-wise concatenation which is then bilinearly upsampled 4× to the full resolution.

3. Ablation Studies

3.1. Model Variants

Configuration of Transformer backbone variants.
  • Two variants of the encoder: “T-Base” and “T-Large” with 12 and 24 layers respectively. “T-Large” is used as the encoder for SETR-Naïve, SETR-PUP, and SETR-MLA. SETR-Naïve-Base is denoted as the model utilizing “T-Base” in SETR-Naïve.
  • Besides SETR-Naïve, SETR-PUP, and SETR-MLA, a hybrid baseline Hybrid is also designed for comparison, by using a ResNet-50 based FCN encoder and feeding its output feature into SETR. This Hybrid is a combination of ResNet-50 and SETR-Naïve-Base.
  • There are pre-trained weights provided by ViT or DeiT.

3.2. Different Pre-Training Strategies and Backbones

Comparing SETR variants on different pre-training strategies and backbones.

By progressively upsampling the feature maps, SETR-PUP achieves the best performance among all the variants on Cityscapes.

  • The variants using “T-Large” (e.g., SETR-MLA and SETR-Naïve) are superior to their “T-Base” counterparts.
  • Randomly initialized SETR-PUP only gives 42.27% mIoU on Cityscapes. Model pre-trained with DeiT on ImageNet-1K gives the best performance on Cityscapes, slightly better than the counterpart pre-trained with ViT on ImageNet-21K.

3.3. Different Pre-Training Strategies

Comparison to FCN with different pre-training with single-scale inference on the ADE20K val and Cityscapes val set.
  • With ImageNet-21k pre-training FCN baseline experienced a clear improvement over the variant pre-trained on ImageNet-1k.

However, SETR method outperforms the FCN counterparts by a large margin, verifying that the advantage of the proposed approach largely comes from the proposed sequence-to-sequence modeling strategy rather than bigger pre-training data.

4. Experimental Results

4.1. ADE20K

State-of-the-art comparison on the ADE20K dataset.
  • SETR-MLA achieves superior mIoU of 48.64% with single-scale (SS) inference. When multi-scale inference is adopted, 50.28% mIoU is obtained.

The proposed method ranks 1st place in the highly competitive ADE20K test server leaderboard.

Qualitative results on ADE20K: SETR (right column) vs. dilated FCN baseline (left column)

4.2. Pascal Context

State-of-the-art comparison on the Pascal Context dataset.

The proposed SETR significantly outperforms this baseline, achieving mIoU of 54.40% (SETR-PUP) and 54.87% (SETR-MLA).

  • SETR-MLA further improves the performance to 55.83% when multi-scale (MS) inference is adopted.
Qualitative results on Pascal Context: SETR (right column) vs. dilated FCN baseline (left column)

4.3. Cityscapes

State-of-the-art comparison on the Cityscapes validation set.
Comparison on the Cityscapes test set.

SETR-PUP is superior to FCN baselines, and FCN plus attention based approaches, such as Non-local, and CCNet [24].

SETR, using smaller image size, is still superior to Axial-DeepLab when multi-scale inference is adopted on Cityscapes validation set.

  • Using the fine set only, SETR model (trained with 100k iterations) outperforms Axial-DeepLab-XL with a clear margin on the test set.
Qualitative results on Cityscapes: SETR (right column) vs. dilated FCN baseline (left column)


[2021 CVPR] [SETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

1.6. Semantic Segmentation / Scene Parsing

20152021 [PVT, PVTv1] [SETR] 2022 [PVTv2]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.