Review — SETR: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
SETR, Ranks 1st Place in the Highly Competitive ADE20K Test Server Leaderboard
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers,
SETR, by Fudan University, University of Oxford, University of Surrey, Tencent Youtu Lab, and Facebook AI
2021 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Semantic Segmentation, Transformer, Vision Transformer, ViT
- Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged.
- In this paper, SEgmentation TRansformer (SETR) is proposed, which treats semantic segmentation as a sequence-to-sequence prediction task, via a pure Transformer, to encode an image as a sequence of patches.
Outline
- SEgmentation TRansformer (SETR) Input & Encoder
- SETR Decoder Designs
- Ablation Studies
- Experimental Results
1. SEgmentation TRansformer (SETR) Input & Encoder
1.1. Input
- A straightforward way for image sequentialization is to flatten the image pixel values into a 1D vector with size of 3HW. However, quadratic model complexity of Transformer is introduced. Instead:
An image x of size H×W×3 is divided into a grid of H/16×W/16 patches uniformly, and then flatten this grid into a sequence.
- By further mapping each vectorized patch p into a latent C-dimensional embedding space using a linear projection function f: p→e, a 1D sequence of patch embeddings for an image x is obtained.
- To encode the patch spacial information, a specific embedding pi is learnt for every location i, which is added to ei to form the final sequence input E={e1+p1, e2+p2, …, eL+pL}. This way, spatial information is kept despite the orderless self-attention nature of transformers.
1.2. Transformer
- (Please feel free to read Transformer if interested.)
- The Transformer encoder consists of Le layers of multi-head self-attention (MSA) and Multilayer Perceptron (MLP) blocks.
- Layer norm is applied before MSA and MLP blocks.
{Z1, Z2, · · · , ZLe} are the features of Transformer layers.
2. SETR Decoder Designs
- 3 designs are proposed. Naive one is the baseline.
2.1. SETR-Naïve: Naïve upsampling (Naïve)
- A simple 2-layer network with architecture: 1×1 conv + sync batch norm (w/ ReLU) + 1×1 conv. After that, the output is simply bilinearly upsampled to the full image resolution, followed by a classification layer with pixel-wise cross-entropy loss.
2.2. SETR-PUP: Progressive UPsampling (PUP)
- A progressive upsampling strategy that alternates conv layers and upsampling operations, is proposed.
- To maximally mitigate the adversarial effect, we restrict upsampling to 2×. Hence, a total of 4 operations are needed for reaching the full resolution.
2.3. SETR-MLA: Multi-Level feature Aggregation (MLA)
- MLA has a similar spirit of feature pyramid network (FPN). However, the decoder is fundamentally different because the feature representations Zl of every SETR’s layer share the same resolution without a pyramid shape.
- The encoder’s feature Zl is first reshaped from a 2D shape of HW/256×C to a 3D feature map H/16×W/16×C.
- A 3-layer (kernel size 1×1, 3×3, and 3×3) network is applied with the feature channels halved at the first and third layers respectively, and the spatial resolution upscaled 4× by bilinear operation after the third layer.
- A top-down aggregation design is applied via element-wise addition after the first layer. An additional 3×3 conv is applied hereafter.
- The fused feature from all the streams via channel-wise concatenation which is then bilinearly upsampled 4× to the full resolution.
3. Ablation Studies
3.1. Model Variants
- Two variants of the encoder: “T-Base” and “T-Large” with 12 and 24 layers respectively. “T-Large” is used as the encoder for SETR-Naïve, SETR-PUP, and SETR-MLA. SETR-Naïve-Base is denoted as the model utilizing “T-Base” in SETR-Naïve.
- Besides SETR-Naïve, SETR-PUP, and SETR-MLA, a hybrid baseline Hybrid is also designed for comparison, by using a ResNet-50 based FCN encoder and feeding its output feature into SETR. This Hybrid is a combination of ResNet-50 and SETR-Naïve-Base.
- There are pre-trained weights provided by ViT or DeiT.
3.2. Different Pre-Training Strategies and Backbones
By progressively upsampling the feature maps, SETR-PUP achieves the best performance among all the variants on Cityscapes.
- The variants using “T-Large” (e.g., SETR-MLA and SETR-Naïve) are superior to their “T-Base” counterparts.
- Randomly initialized SETR-PUP only gives 42.27% mIoU on Cityscapes. Model pre-trained with DeiT on ImageNet-1K gives the best performance on Cityscapes, slightly better than the counterpart pre-trained with ViT on ImageNet-21K.
3.3. Different Pre-Training Strategies
- With ImageNet-21k pre-training FCN baseline experienced a clear improvement over the variant pre-trained on ImageNet-1k.
However, SETR method outperforms the FCN counterparts by a large margin, verifying that the advantage of the proposed approach largely comes from the proposed sequence-to-sequence modeling strategy rather than bigger pre-training data.
4. Experimental Results
4.1. ADE20K
- SETR-MLA achieves superior mIoU of 48.64% with single-scale (SS) inference. When multi-scale inference is adopted, 50.28% mIoU is obtained.
The proposed method ranks 1st place in the highly competitive ADE20K test server leaderboard.
4.2. Pascal Context
The proposed SETR significantly outperforms this baseline, achieving mIoU of 54.40% (SETR-PUP) and 54.87% (SETR-MLA).
- SETR-MLA further improves the performance to 55.83% when multi-scale (MS) inference is adopted.
4.3. Cityscapes
SETR-PUP is superior to FCN baselines, and FCN plus attention based approaches, such as Non-local, and CCNet [24].
SETR, using smaller image size, is still superior to Axial-DeepLab when multi-scale inference is adopted on Cityscapes validation set.
- Using the fine set only, SETR model (trained with 100k iterations) outperforms Axial-DeepLab-XL with a clear margin on the test set.
Reference
[2021 CVPR] [SETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
1.6. Semantic Segmentation / Scene Parsing
2015 … 2021 [PVT, PVTv1] [SETR] 2022 [PVTv2]