Review: Swin Transformer

Using shifted windows, limit the attentions within local area, but maintaining cross-window connection

Swin Transformer vs ViT
  • A hierarchical Transformer is proposed, whose representation is computed with Shifted windows (Swin).
  • The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
  • This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.


  1. Swin Transformer
  2. Shifted Window Based Self-Attention
  3. Architecture Variants
  4. SOTA Comparison
  5. Ablation Study
  6. Swin-Mixer

1. Swin Transformer

(a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks

1.1. Input

  • Swin Transformer (Swin-T for the above image, T for Tiny) first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT.
  • Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. A patch size of 4×4 is used and thus the feature dimension of each patch is 4×4×3=48. A linear embedding layer is applied on this raw-valued feature to project it to an arbitrary dimension C.

1.2. Stage 1

  • Several Transformer blocks with modified self-attention computation (Swin Transformer blocks) are applied on these patch tokens. The Transformer blocks maintain the number of tokens (H/4×W/4), and together with the linear embedding are referred to as “Stage 1”.

1.3. Stage 2

  • To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The first patch merging layer concatenates the features of each group of 2×2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2 downsampling of resolution)
  • The output dimension is set to 2C, and the resolution is kept at H/8×W/8. This first block of patch merging and feature transformation is denoted as “Stage 2”.

1.4. Stage 3 & Stage 4

  • The procedure is repeated twice, as “Stage 3” and “Stage 4”, with output resolutions of H/16×W/16 and H/32×W/32, respectively.
  • These stages jointly produce a hierarchical representation, with the same feature map resolutions as those of typical convolutional networks, such as VGGNet and ResNet, which can conveniently replace the backbone networks in existing methods for various vision tasks.

1.5. Swin Transformer Block

  • As illustrated in Figure (b), a Swin Transformer block consists of a shifted window based MSA module, followed by a 2-layer MLP with GELU nonlinearity in between.
  • A LayerNorm (LN) layer is applied before each MSA module and each MLP, and a residual connection is applied after each module.

2. Shifted Window Based Self-Attention

An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture

2.1. Window Based Self-Attention (W-MSA)

  • Supposing each window contains M×M patches, the computational complexity of a global MSA module and a window based one on an image of h×w patches are:
  • where the former is quadratic to patch number hw, and the latter is linear when M is fixed (set to 7 by default).

2.2. Shifted Window Partitioning in Successive Blocks (SW-MSA)

  • The window-based self-attention module lacks connections across windows, which limits its modeling power.
  • A shifted window partitioning approach is proposed which alternates between two partitioning configurations in consecutive Swin Transformer blocks.
  • Thus, consecutive Swin Transformer blocks are computed as:
  • where zl-1 is the output features from the previous layer.

2.3. Efficient Batch Computation for Shifted Configuration (Cyclic Shift)

Illustration of an efficient batch computation approach for self-attention in shifted window partitioning
  • An issue with shifted window partitioning is that it will result in more windows.
  • A more efficient batch computation approach is proposed by cyclic-shifting toward the top-left direction, as shown above.
  • With the cyclic-shift, the number of batched windows remains the same as that of regular window partitioning, and thus is also efficient.

2.4. Relative Position Bias (Rel. Pos.)

  • Relative position bias B is included to each head in computing similarity.
  • which observes significant improvements over counterparts without this bias term or that use absolute position embedding.
  • Further adding absolute position embedding to the input drops performance slightly.

3. Architecture Variants

  • Base model, called Swin-B, have model size and computation complexity similar to ViT-B/DeiT-B.
  • Swin-T, Swin-S and Swin-L, are versions of about 0.25×, 0.5× and 2× the model size and computational complexity, respectively. The complexity of Swin-T and Swin-S are similar to those of ResNet-50 (DeiT-S) and ResNet-101, respectively.
  • The window size is set to M=7 by default. The query dimension of each head is d=32, and the expansion layer of each MLP is α=4, for all experiments. The architecture hyper-parameters of these model variants are:
  1. Swin-T: C=96, layer numbers = {2, 2, 6, 2}
  2. Swin-S: C=96, layer numbers ={2, 2, 18, 2}
  3. Swin-B: C=128, layer numbers ={2, 2, 18, 2}
  4. Swin-L: C=192, layer numbers ={2, 2, 18, 2}
  • where C is the channel number of the hidden layers in the first stage.

4. SOTA Comparison

4.1. Image Classification on ImageNet

Comparison of different backbones on ImageNet-1K classification

4.1.1. ImageNet-1K Training

  • Compared with the state-of-the-art ConvNets, i.e. RegNet and EfficientNet, the Swin Transformer achieves a slightly better speed-accuracy trade-off.

4.1.2. ImageNet-22K Pretraining

  • For Swin-B, the ImageNet-22K pre-training brings 1.8%~1.9% gains over training on ImageNet-1K from scratch.
  • The larger Swin-L model achieves 87.3% top-1 accuracy, +0.9% better than that of the Swin-B model.

4.2. Object Detection on COCO

Results on COCO object detection and instance segmentation
  • Swin-T architecture brings consistent +3.4~4.2 box AP gains over ResNet-50, with slightly larger model size, FLOPs and latency.
  • The results of Swin-T are +2.5 box AP and +2.3 mask AP higher than DeiT-S with similar model size (86M vs. 80M) and significantly higher inference speed (15.3 FPS vs. 10.4 FPS).

4.3. Semantic Segmentation on ADE20K

Results of semantic segmentation on the ADE20K val and test set
  • UPerNet is used as the framework.
  • Swin-S is +5.3 mIoU higher (49.3 vs. 44.0) than DeiT-S with similar computation cost. It is also +4.4 mIoU higher than ResNet-101, and +2.4 mIoU higher than ResNeSt-101.

5. Ablation Study

Ablation study on the shifted windows approach and different position embedding methods on three benchmarks, using the Swin-T architecture
  • Swin-T with the shifted window partitioning outperforms the counterpart built on a single window partitioning at each stage by +1.1% top-1 accuracy on ImageNet-1K, +2.8 box AP/+2.2 mask AP on COCO, and +2.8 mIoU on ADE20K.
  • Swin-T with relative position bias yields +1.2%/+0.8% top-1 accuracy on ImageNet-1K, +1.3/+1.5 box AP and +1.1/+1.3 mask AP on COCO, and +2.3/+2.9 mIoU on ADE20K in relation to those without position encoding and with absolute position embedding, respectively.
Real speed of different self-attention computation methods and implementations on a V100 GPU
  • The cyclic implementation is more hardware efficient than naive padding, particularly for deeper stages. Overall, it brings a 13%, 18% and 18% speed-up on Swin-T, Swin-S and Swin-B, respectively.
Accuracy of Swin Transformer using different methods for self-attention computation on three benchmarks
  • Swin Transformer architectures are slightly faster, while achieving +2.3% top-1 accuracy compared to Performer [14] on ImageNet-1K using Swin-T.

6. Swin-Mixer

Performance of Swin MLP-Mixer on ImageNet-1K classification
  • There are still many experiments in the appendix. One of the experiments is to apply the proposed hierarchical design and the shifted window approach to the MLP-Mixer, referred to as Swin-Mixer.
  • It has better speed accuracy trade-off compared to ResMLP.



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store