Review: Swin Transformer
Using shifted windows, limit the attentions within local area, but maintaining cross-window connection
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, Swin Transformer, by Microsoft Research Asia
2021 ICCV, Over 750 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Object Detection, Vision Transformer, ViT, Transformer
- A hierarchical Transformer is proposed, whose representation is computed with Shifted windows (Swin).
- The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
- This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size.
Outlined
- Swin Transformer
- Shifted Window Based Self-Attention
- Architecture Variants
- SOTA Comparison
- Ablation Study
- Swin-Mixer
1. Swin Transformer
1.1. Input
- Swin Transformer (Swin-T for the above image, T for Tiny) first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT.
- Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. A patch size of 4×4 is used and thus the feature dimension of each patch is 4×4×3=48. A linear embedding layer is applied on this raw-valued feature to project it to an arbitrary dimension C.
1.2. Stage 1
- Several Transformer blocks with modified self-attention computation (Swin Transformer blocks) are applied on these patch tokens. The Transformer blocks maintain the number of tokens (H/4×W/4), and together with the linear embedding are referred to as “Stage 1”.
1.3. Stage 2
- To produce a hierarchical representation, the number of tokens is reduced by patch merging layers as the network gets deeper. The first patch merging layer concatenates the features of each group of 2×2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2×2 = 4 (2 downsampling of resolution)
- The output dimension is set to 2C, and the resolution is kept at H/8×W/8. This first block of patch merging and feature transformation is denoted as “Stage 2”.
1.4. Stage 3 & Stage 4
- The procedure is repeated twice, as “Stage 3” and “Stage 4”, with output resolutions of H/16×W/16 and H/32×W/32, respectively.
- These stages jointly produce a hierarchical representation, with the same feature map resolutions as those of typical convolutional networks, such as VGGNet and ResNet, which can conveniently replace the backbone networks in existing methods for various vision tasks.
1.5. Swin Transformer Block
2. Shifted Window Based Self-Attention
2.1. Window Based Self-Attention (W-MSA)
Self-attention is computed within local windows.
- Supposing each window contains M×M patches, the computational complexity of a global MSA module and a window based one on an image of h×w patches are:
- where the former is quadratic to patch number hw, and the latter is linear when M is fixed (set to 7 by default).
Global self-attention computation (in standard ViT) is generally unaffordable for a large hw, while the window based self-attention is scalable.
2.2. Shifted Window Partitioning in Successive Blocks (SW-MSA)
- The window-based self-attention module lacks connections across windows, which limits its modeling power.
- A shifted window partitioning approach is proposed which alternates between two partitioning configurations in consecutive Swin Transformer blocks.
The first module uses a regular window partitioning strategy.
Then, the next module adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows.
- Thus, consecutive Swin Transformer blocks are computed as:
- where zl-1 is the output features from the previous layer.
2.3. Efficient Batch Computation for Shifted Configuration (Cyclic Shift)
- An issue with shifted window partitioning is that it will result in more windows.
- A more efficient batch computation approach is proposed by cyclic-shifting toward the top-left direction, as shown above.
- With the cyclic-shift, the number of batched windows remains the same as that of regular window partitioning, and thus is also efficient.
2.4. Relative Position Bias (Rel. Pos.)
- Relative position bias B is included to each head in computing similarity.
- which observes significant improvements over counterparts without this bias term or that use absolute position embedding.
- Further adding absolute position embedding to the input drops performance slightly.
3. Architecture Variants
- Base model, called Swin-B, have model size and computation complexity similar to ViT-B/DeiT-B.
- Swin-T, Swin-S and Swin-L, are versions of about 0.25×, 0.5× and 2× the model size and computational complexity, respectively. The complexity of Swin-T and Swin-S are similar to those of ResNet-50 (DeiT-S) and ResNet-101, respectively.
- The window size is set to M=7 by default. The query dimension of each head is d=32, and the expansion layer of each MLP is α=4, for all experiments. The architecture hyper-parameters of these model variants are:
- Swin-T: C=96, layer numbers = {2, 2, 6, 2}
- Swin-S: C=96, layer numbers ={2, 2, 18, 2}
- Swin-B: C=128, layer numbers ={2, 2, 18, 2}
- Swin-L: C=192, layer numbers ={2, 2, 18, 2}
- where C is the channel number of the hidden layers in the first stage.
4. SOTA Comparison
4.1. Image Classification on ImageNet
4.1.1. ImageNet-1K Training
Swin Transformers noticeably surpass the counterpart DeiT architectures with similar complexities: +1.5% for Swin-T (81.3%) over DeiT-S (79.8%) using 2242 input, and +1.5%/1.4% for Swin-B (83.3%/84.5%) over DeiT-B (81.8%/83.1%) using 224²/384² input, respectively.
- Compared with the state-of-the-art ConvNets, i.e. RegNet and EfficientNet, the Swin Transformer achieves a slightly better speed-accuracy trade-off.
4.1.2. ImageNet-22K Pretraining
- For Swin-B, the ImageNet-22K pre-training brings 1.8%~1.9% gains over training on ImageNet-1K from scratch.
Swin Transformer models achieve significantly better speed-accuracy trade-offs: Swin-B obtains 86.4% top-1 accuracy, which is 2.4% higher than that of ViT with similar inference throughput (84.7 vs. 85.9 images/sec) and slightly lower FLOPs (47.0G vs. 55.4G).
- The larger Swin-L model achieves 87.3% top-1 accuracy, +0.9% better than that of the Swin-B model.
4.2. Object Detection on COCO
- Swin-T architecture brings consistent +3.4~4.2 box AP gains over ResNet-50, with slightly larger model size, FLOPs and latency.
Compared with ResNeXt, Swin Transformer achieves a high detection accuracy of 51.9 box AP and 45.0 mask AP, which are significant gains of +3.6 box AP and +3.3 mask AP over ResNeXt101-64×4d, which has similar model size, FLOPs and latency.
- The results of Swin-T are +2.5 box AP and +2.3 mask AP higher than DeiT-S with similar model size (86M vs. 80M) and significantly higher inference speed (15.3 FPS vs. 10.4 FPS).
The best model achieves 58.7 box AP and 51.1 mask AP on COCO test-dev, surpassing the previous best results by +2.7 box AP (Copy-paste [26] without external data) and +2.6 mask AP (DetectoRS [46]).
4.3. Semantic Segmentation on ADE20K
- UPerNet is used as the framework.
- Swin-S is +5.3 mIoU higher (49.3 vs. 44.0) than DeiT-S with similar computation cost. It is also +4.4 mIoU higher than ResNet-101, and +2.4 mIoU higher than ResNeSt-101.
Swin-L model with ImageNet-22K pre-training achieves 53.5 mIoU on the val set, surpassing the previous best model by +3.2 mIoU (50.3 mIoU by SETR [81] which has a larger model size).
5. Ablation Study
- Swin-T with the shifted window partitioning outperforms the counterpart built on a single window partitioning at each stage by +1.1% top-1 accuracy on ImageNet-1K, +2.8 box AP/+2.2 mask AP on COCO, and +2.8 mIoU on ADE20K.
- Swin-T with relative position bias yields +1.2%/+0.8% top-1 accuracy on ImageNet-1K, +1.3/+1.5 box AP and +1.1/+1.3 mask AP on COCO, and +2.3/+2.9 mIoU on ADE20K in relation to those without position encoding and with absolute position embedding, respectively.
- The cyclic implementation is more hardware efficient than naive padding, particularly for deeper stages. Overall, it brings a 13%, 18% and 18% speed-up on Swin-T, Swin-S and Swin-B, respectively.
- Swin Transformer architectures are slightly faster, while achieving +2.3% top-1 accuracy compared to Performer [14] on ImageNet-1K using Swin-T.
Reference
[2021 ICCV] [Swin Transformer]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Image Classification
1989–2019 … 2020: [Random Erasing (RE)] [SAOL] [AdderNet] [FixEfficientNet] [BiT] [RandAugment] [ImageNet-ReaL]
2021: [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer]