Review — CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

CSWin Transformer, Cross-Shaped Window for Self Attention

  • Cross-Shaped Window (CSWin) self-attention mechanism is proposed for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window.
  • Locally-enhanced Positional Encoding (LePE) is proposed, which handles the local positional information better than existing encoding schemes.


  1. CSWin Transformer
  2. Locally-enhanced Positional Encoding (LePE)
  3. Model Variants
  4. Experimental Results
  5. Ablation Study

1. CSWin Transformer

1.1. Framework

Left: the overall architecture of our proposed CSWin Transformer, Right: the illustration of CSWin Transformer block.
  • For an input image with size of H×W×3, authors follow CvT and leverage the overlapped convolutional token embedding (7 × 7 convolution layer with stride 4) to obtain H/4×W/4 patch tokens, and the dimension of each token is C.
  • To produce a hierarchical representation, the whole network consists of four stages. A convolution layer (3 × 3, stride 2) is used between two adjacent stages to reduce the number of tokens and double the channel dimension.
  • Each stage consists of Ni sequential CSWin Transformer Blocks and maintains the number of tokens. CSWin Transformer Block has the overall similar topology as the vanilla multi-head self-attention Transformer block with two differences:
  1. It replaces the self-attention mechanism with the proposed Cross-Shaped Window Self-Attention.
  2. In order to introduce the local inductive bias, LePE is added as a parallel module to the self-attention branch.

1.2. Cross-Shaped Window (CSWin) Self-Attention

Illustration of different self-attention mechanisms
  • To alleviate this issue, existing works suggest to perform self-attention in a local attention window and apply halo or shifted window to enlarge the receptive filed. However, there is only limited attention area. This requires stacking more blocks to achieve global receptive field.
  • Figure Left: The input feature X is first linearly projected to K heads, and then each head will perform local self-attention within either the horizontal or vertical stripes.
  • For horizontal stripes self-attention, X is evenly partitioned into non-overlapping horizontal stripes [X1, .., XM] of equal width sw, and each of them contains sw×W tokens.
  • where M=H/sw, dimension dk=C/K.
  • The vertical stripes self-attention can be similarly derived, and its output for k-th head is denoted as V-Attentionk(X).
  • Assuming natural images do not have directional bias, the K heads are equally split into two parallel groups (each has K/2 heads).
  • The first group of heads perform horizontal stripes self-attention while the second group of heads perform vertical stripes self-attention.
  • Finally the output of these two parallel groups will be concatenated back together:

1.3. Computational Complexity

  • The computation complexity of CSWin self-attention is:
  • For high-resolution inputs, considering H, W will be larger than C in the early stages and smaller than C in the later stages. Small sw is chosen for early stages and larger sw is chosen for later stages.
  • sw is set to 1, 2, 7, 7 for four stages by default.

2. Locally-enhanced Positional Encoding (LePE)

Comparison among different positional encoding mechanisms
  • The self-attention computation could be formulated as:
  • where the input sequence as x=(x1, …, xn) of n elements, and the output of the attention z=(z1, …, zn) of the same length.
  • Then the proposed Locally-Enhanced position encoding performs as a learnable per-element bias:
  • To make the LePE suitable to varying input size, a distance threshold is set to the LePE and set it to 0 if the Chebyshev distance of token i and j is greater than a threshold τ (τ=3 in the default setting).

3. Model Variants

  • Finally, CSWin Transformer Block becomes:
  • 4 variants are defined:
Detailed configurations of different variants of CSWin Transformer.

4. Experimental Results

4.1. ImageNet

Comparison of different models on ImageNet-1K.
  • For example, CSWin-T achieves 82.7% Top-1 accuracy with only 4.3G FLOPs, surpassing CvT-13, Swin-T and DeiT-S by 1.1%, 1.4% and 2.9% respectively.
  • And for the small and base model setting, our CSWin-S and CSWin-B also achieve the best performance.
  • When finetuned on the 384×384 input, a similar trend is observed.
ImageNet-1K fine-tuning results by pre-training on ImageNet-21K datasets.
  • When pre-training CSWin Transformer on ImageNet-21K dataset, for CSWin-B, the large-scale data of ImageNet-21K brings a 1.6%∼1.7% gain. CSWin-B and CSWin-L achieve 87.0% and 87.5% top-1 accuracy, surpassing previous methods.

4.2. COCO

Object detection and instance segmentation performance on the COCO val2017 with the v framework.
  • CSWin-T outperforms Swin-T by +4.5 box AP, +3.1 mask AP with the 1× schedule and +3.0 box AP, +2.0 mask AP with the 3× schedule respectively.
Object detection and instance segmentation performance on the COCO val2017 with Cascade Mask R-CNN.
  • When using Cascade Mask R-CNN, CSWin Transformers still surpass the counterparts by promising margins under different model configurations.

4.3. ADE20K

Performance comparison of different backbones on the ADE20K segmentation task. (+: ImageNet-21K)
  • CSWin-T, CSWin-S, CSWin-B achieve +6.7, +4.0, +3.9 higher mIOU than the Swin counterparts with the Semantic FPN framework, and +4.8, +2.8, +3.0 higher mIOU with the UPerNet framework.
  • When using the ImageNet-21K pretrained model, CSWin-L further achieves 55.7 mIoU and surpasses the previous best model by +2.2 mIoU, while using less computation complexity.

4.4. Inference Speed

FPS comparison with Swin on downstream tasks.
  • In most cases, the speed of the proposed model is only slightly slower than Swin (less than 10%), but the proposed model outperforms Swin by large margins.

5. Ablation Study

Stripes-Based attention mechanism comparison.
Ablation on dynamic window size.
Comparison of different self-attention mechanisms.
Comparison of different positional encoding mechanisms.


[2022 CVPR] [CSWin Transformer]
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

1.1. Image Classification

19892022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer]

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store