Review: Swin Transformer

Using shifted windows, limit the attentions within local area, but maintaining cross-window connection

8 min readFeb 22, 2022

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Swin Transformer, by Microsoft Research Asia
2021 ICCV, Over 750 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Object Detection, Vision Transformer, ViT, Transformer

A hierarchical Transformer is proposed, whose representation is computed with Shifted windows (Swin).
The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.

Outlined

Swin Transformer
Shifted Window Based Self-Attention
Architecture Variants
SOTA Comparison
Ablation Study
Swin-Mixer

1. Swin Transformer

**(a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks**

1.1. Input

Swin Transformer (Swin-T for the above image, T for Tiny) first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT.
Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. A patch size of 4×4 is used and thus the feature dimension of each patch is 4×4×3=48. A linear embedding layer is applied on this raw-valued feature to project it to an arbitrary dimension C.

1.2. Stage 1

Several Transformer blocks with modified self-attention computation (Swin Transformer blocks) are applied on these patch tokens. The Transformer blocks maintain the number of tokens (H/4×W/4), and together with the linear embedding are referred to as “Stage 1”.

1.3. Stage 2

To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The first patch merging layer concatenates the features of each group of 2×2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2 downsampling of resolution)
The output dimension is set to 2C, and the resolution is kept at H/8×W/8. This first block of patch merging and feature transformation is denoted as “Stage 2”.

1.4. Stage 3 & Stage 4

The procedure is repeated twice, as “Stage 3” and “Stage 4”, with output resolutions of H/16×W/16 and H/32×W/32, respectively.
These stages jointly produce a hierarchical representation, with the same feature map resolutions as those of typical convolutional networks, such as VGGNet and ResNet, which can conveniently replace the backbone networks in existing methods for various vision tasks.

1.5. Swin Transformer Block

As illustrated in Figure (b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between.
A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.

2. Shifted Window Based Self-Attention

**An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture**

2.1. Window Based Self-Attention (W-MSA)

Self-attention is computed within local windows.

Supposing each window contains M×M patches, the computational complexity of a global MSA module and a window based one on an image of h×w patches are:

where the former is quadratic to patch number hw, and the latter is linear when M is fixed (set to 7 by default).

Global self-attention computation (in standard ViT) is generally unaffordable for a large hw, while the window based self-attention is scalable.

2.2. Shifted Window Partitioning in Successive Blocks (SW-MSA)

The window-based self-attention module lacks connections across windows, which limits its modeling power.
A shifted window partitioning approach is proposed which alternates between two partitioning configurations in consecutive Swin Transformer blocks.

The first module uses a regular window partitioning strategy.
Then, the next module adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows.

Thus, consecutive Swin Transformer blocks are computed as:

where zl-1 is the output features from the previous layer.

2.3. Efficient Batch Computation for Shifted Configuration (Cyclic Shift)

**Illustration of an efficient batch computation approach for self-attention in shifted window partitioning**

An issue with shifted window partitioning is that it will result in more windows.
A more efficient batch computation approach is proposed by cyclic-shifting toward the top-left direction, as shown above.
With the cyclic-shift, the number of batched windows remains the same as that of regular window partitioning, and thus is also efficient.

2.4. Relative Position Bias (Rel. Pos.)

Relative position bias B is included to each head in computing similarity.

which observes significant improvements over counterparts without this bias term or that use absolute position embedding.
Further adding absolute position embedding to the input drops performance slightly.

3. Architecture Variants

Base model, called Swin-B, have model size and computation complexity similar to ViT-B/DeiT-B.
Swin-T, Swin-S and Swin-L, are versions of about 0.25×, 0.5× and 2× the model size and computational complexity, respectively. The complexity of Swin-T and Swin-S are similar to those of ResNet-50 (DeiT-S) and ResNet-101, respectively.
The window size is set to M=7 by default. The query dimension of each head is d=32, and the expansion layer of each MLP is α=4, for all experiments. The architecture hyper-parameters of these model variants are:

Swin-T: C=96, layer numbers = {2, 2, 6, 2}
Swin-S: C=96, layer numbers ={2, 2, 18, 2}
Swin-B: C=128, layer numbers ={2, 2, 18, 2}
Swin-L: C=192, layer numbers ={2, 2, 18, 2}

where C is the channel number of the hidden layers in the first stage.

4. SOTA Comparison

4.1. Image Classification on ImageNet

**Comparison of different backbones on ImageNet-1K classification**

4.1.1. ImageNet-1K Training

Swin Transformers noticeably surpass the counterpart DeiT architectures with similar complexities: +1.5% for Swin-T (81.3%) over DeiT-S (79.8%) using 2242 input, and +1.5%/1.4% for Swin-B (83.3%/84.5%) over DeiT-B (81.8%/83.1%) using 224²/384² input, respectively.

Compared with the state-of-the-art ConvNets, i.e. RegNet and EfficientNet, the Swin Transformer achieves a slightly better speed-accuracy trade-off.

4.1.2. ImageNet-22K Pretraining

For Swin-B, the ImageNet-22K pre-training brings 1.8%~1.9% gains over training on ImageNet-1K from scratch.

Swin Transformer models achieve significantly better speed-accuracy trade-offs: Swin-B obtains 86.4% top-1 accuracy, which is 2.4% higher than that of ViT with similar inference throughput (84.7 vs. 85.9 images/sec) and slightly lower FLOPs (47.0G vs. 55.4G).

The larger Swin-L model achieves 87.3% top-1 accuracy, +0.9% better than that of the Swin-B model.

4.2. Object Detection on COCO

**Results on COCO object detection and instance segmentation**

Swin-T architecture brings consistent +3.4~4.2 box AP gains over ResNet-50, with slightly larger model size, FLOPs and latency.

Compared with ResNeXt, Swin Transformer achieves a high detection accuracy of 51.9 box AP and 45.0 mask AP, which are significant gains of +3.6 box AP and +3.3 mask AP over ResNeXt101-64×4d, which has similar model size, FLOPs and latency.

The results of Swin-T are +2.5 box AP and +2.3 mask AP higher than DeiT-S with similar model size (86M vs. 80M) and significantly higher inference speed (15.3 FPS vs. 10.4 FPS).

The best model achieves 58.7 box AP and 51.1 mask AP on COCO test-dev, surpassing the previous best results by +2.7 box AP (Copy-paste [26] without external data) and +2.6 mask AP (DetectoRS [46]).

4.3. Semantic Segmentation on ADE20K

**Results of semantic segmentation on the** **ADE20K** **val and test set**

UPerNet is used as the framework.
Swin-S is +5.3 mIoU higher (49.3 vs. 44.0) than DeiT-S with similar computation cost. It is also +4.4 mIoU higher than ResNet-101, and +2.4 mIoU higher than ResNeSt-101.

Swin-L model with ImageNet-22K pre-training achieves 53.5 mIoU on the val set, surpassing the previous best model by +3.2 mIoU (50.3 mIoU by SETR [81] which has a larger model size).

5. Ablation Study

**Ablation study on the shifted windows approach and different position embedding methods on three benchmarks, using the Swin-T architecture**

Swin-T with the shifted window partitioning outperforms the counterpart built on a single window partitioning at each stage by +1.1% top-1 accuracy on ImageNet-1K, +2.8 box AP/+2.2 mask AP on COCO, and +2.8 mIoU on ADE20K.
Swin-T with relative position bias yields +1.2%/+0.8% top-1 accuracy on ImageNet-1K, +1.3/+1.5 box AP and +1.1/+1.3 mask AP on COCO, and +2.3/+2.9 mIoU on ADE20K in relation to those without position encoding and with absolute position embedding, respectively.

**Real speed of different self-attention computation methods and implementations on a V100 GPU**

The cyclic implementation is more hardware efficient than naive padding, particularly for deeper stages. Overall, it brings a 13%, 18% and 18% speed-up on Swin-T, Swin-S and Swin-B, respectively.

**Accuracy of Swin Transformer using different methods for self-attention computation on three benchmarks**

Swin Transformer architectures are slightly faster, while achieving +2.3% top-1 accuracy compared to Performer [14] on ImageNet-1K using Swin-T.

6. Swin-Mixer

**Performance of Swin** **MLP-Mixer** **on ImageNet-1K classification**

There are still many experiments in the appendix. One of the experiments is to apply the proposed hierarchical design and the shifted window approach to the MLP-Mixer, referred to as Swin-Mixer.
It has better speed accuracy trade-off compared to ResMLP.

Reference

[2021 ICCV] [Swin Transformer]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Image Classification

1989–2019 … 2020: [Random Erasing (RE)] [SAOL] [AdderNet] [FixEfficientNet] [BiT] [RandAugment] [ImageNet-ReaL]
2021: [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer]

Review: Swin Transformer

Using shifted windows, limit the attentions within local area, but maintaining cross-window connection

Outlined

1. Swin Transformer

1.1. Input

1.2. Stage 1

1.3. Stage 2

1.4. Stage 3 & Stage 4

1.5. Swin Transformer Block

2. Shifted Window Based Self-Attention

2.1. Window Based Self-Attention (W-MSA)

2.2. Shifted Window Partitioning in Successive Blocks (SW-MSA)

2.3. Efficient Batch Computation for Shifted Configuration (Cyclic Shift)

2.4. Relative Position Bias (Rel. Pos.)

3. Architecture Variants

4. SOTA Comparison

4.1. Image Classification on ImageNet

4.1.1. ImageNet-1K Training

4.1.2. ImageNet-22K Pretraining

4.2. Object Detection on COCO

4.3. Semantic Segmentation on ADE20K

5. Ablation Study

6. Swin-Mixer

Reference

Image Classification

My Other Previous Paper Readings

Written by Sik-Ho Tsang