Review — Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Twins-PCPVT & Twins-SVT, On Par With or Outperforms Swin Transformer

  • Two Vision Transformer architectures, Twins-PCPVT and Twins-SVT, are proposed.
  • Twins-PCPVT: A global sub-sampled attention applied in PVT, and with the applicable positional encodings.
  • Twins-SVT: Spatially Separable Self-Attention (SSSA) is proposed, which is composed of two types of attention operations — (i) Locally-grouped Self-Attention (LSA), and (ii) Global Sub-sampled Attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information.
  • (For quick read, please read 1, 2, and 3.1.)

Outline

  1. Twins-PCPVT
  2. Twins-SVT
  3. SOTA Comparisons
  4. Ablation Study

1. Twins-PCPVT

1.1. Problems in PVT

  • It is found that the less favored performance of PVT is mainly due to the absolute positional encodings employed in PVT.
  • This absolute positional encoding encounter difficulties in processing the inputs with varying sizes. Moreover, this positional encoding also breaks the translation invariance.
  • On the contrary, Swin transformer makes use of the relative positional encodings. this is the main cause why Swin outperforms PVT.

1.2. Twins-PCPVT

Architecture of Twins-PCPVT-S, “PEG” is the positional encoding generator from CPVT
  • The conditional position encoding (CPE) proposed in CPVT is used to replace the absolute PE in PVT.
  • The position encoding generator (PEG) in CPVT, which generates the CPE, is placed after the first encoder block of each stage.
  • The simplest form of PEG, i.e., a 2D depth-wise convolution without batch normalization.
  • For image-level classification, following CPVT, the class token is removed and global average pooling (GAP) is used at the end of the stage.
  • (Please feel free to read PVT and CPVT if interested.)

2. Twins-SVT

2.1. Problems in ViT

  • Vision Transformers suffer severely from the heavy computational complexity in dense prediction tasks due to high-resolution inputs.
  • Given an input of H×W resolution, the complexity of self-attention with dimension d is:
  • where H=W=224 is popular in classification.
(a) Twins-SVT interleaves locally-grouped attention (LSA) and global sub-sampled attention (GSA). (b) Schematic view of the locally-grouped attention (LSA) and global sub-sampled attention (GSA).

2.2. Locally-grouped Self-Attention (LSA)

  • The 2D feature maps are first equally divided into sub-windows, making self-attention communications only happen within each sub-window.
  • To be specific, the feature maps are divided into m×n sub-windows.
  • Each group contains HW/mn elements, and thus the computation cost of the self-attention in this window is:
  • And total cost is:
  • If k1=H/m and k2=W/n, then the total cost become:
  • Yet, a mechanism to communicate between different sub-windows is still needed. Otherwise, the information would be limited to be processed locally.

2.3. Global Sub-sampled Attention (GSA)

  • A single representative for each window is used, which summarizes the important information for each f m×n sub-windows.
  • This representative is used to communicate with other sub-windows (serving as the key in self-attention), which can dramatically reduce the cost to:
  • This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations.

2.4. Spatially Separable Self-Attention (SSSA): Combine LSA and GSA

  • If we alternatively use the aforementioned LSA and GSA like separable convolutions (depth-wise+point-wise), the total cost is:
  • with:
  • The minimum is obtained when:
  • k1=k2=15 is close to the global minimum for H=W=224.
  • Stage 1 has feature maps of 56×56, the minimum is obtained when k1=k2=7.
  • As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4, 2 and 1 are used for the last three stages respectively.
  • As for the sub-sampling function, several options are investigated, including average pooling, depthwise strided convolutions, and regular strided convolutions. Empirical results show that regular strided convolutions perform best here.
  • The overall SSSA, which combines LSA and GSA, is as below:
  • where FFN is feed-forward network.
  • Both LSA and GSA have multiple heads as in the standard self-attention.
  • PEG of CPVT is also used, and it is inserted into after the first block in each stage.

2.5. Twins-SVT Variants & Comparison with PVT & Swin

Twins-SVT Variants
  • There are S, B and L variants.
  • For example, When Twins-SVT-S is converted from PyTorch to TensorRT, and its throughput is boosted by 1.7×.

3. SOTA Comparisons

3.1. Classification on ImageNet-1K

Comparisons with state-of-the-art methods for ImageNet-1K classification
  • Twins-PCPVT-S outperforms PVT-small by 1.4% and obtains similar result as Swin-T with 18% fewer FLOPs.
  • Twins-SVT-S is better than Swin-T with about 35% fewer FLOPs.
  • Other models demonstrate similar advantages.
  • Twins-PCPVT performs on par with the recent state-of-the-art Swin, which is based on much more sophisticated designs as mentioned above.
  • Moreover, Twins-SVT also achieves similar or better results, compared to Swin, indicating that the spatial separable-like design is an effective and promising paradigm.

3.2. Semantic Segmentation on ADE20K

Performance comparisons with different backbones on ADE20K validation dataset
  • Semantic FPN framework is used.
  • With comparable FLOPs, Twins-PCPVT-S outperforms PVT-Small with a large margin (+4.5% mIoU), which also surpasses ResNet-50 by 7.6% mIoU. It also outperforms Swin-T with a clear margin.
  • Besides, Twins-PCPVT-B also achieves 3.3% higher mIoU than PVT-Medium, and Twins-PCPVT-L surpasses PVT-Large with 4.3% higher mIoU.
  • Twins-SVT-S achieves better performance (+1.7%) than Swin-T. Twins-SVT-B obtains comparable performance with Swin-S and Twins-SVT-L outperforms Swin-B by 0.7% mIoU where Swin uses UPerNet framework.
  • Twins-SVT-S outperforms Swin-T by 1.3% mIoU. Moreover, Twins-SVT-L achieves new state of the art result 50.2% mIoU under comparable FLOPs and outperforms Swin-B by 0.5% mIoU.
  • Twins-PCPVT also achieves comparable performance to Swin.

3.3. Object Detection and Segmentation on COCO

Object detection performance on the COCO val2017 split using the RetinaNet framework
  • For 1× schedule object detection with RetinaNet, Twins-PCPVT-S surpasses PVT-Small with 2.6% mAP and Twins-PCPVT-B exceeds PVT-Medium by 2.4% mAP on the COCO val2017 split.
  • Twins-SVT-S outperforms Swin-T with 1.5% mAP while using 12% fewer FLOPs.
  • Twins also outperforms the others with similar advantage in 3× experiments.
Object detection and instance segmentation performance on the COCO val2017 dataset using the Mask R-CNN framework
  • For 1× object segmentation with the Mask R-CNN framework, Twins-PCPVT-S brings similar improvements (+2.5% mAP) over PVT-Small.
  • Compared with PVT-Medium, Twins-PCPVT-B obtains 2.6% higher mAP, which is also on par with that of Swin.
  • Both Twins-SVT-S and Twins-SVT-B achieve better or slightly better performance compared to the counterparts of Swin.

4. Ablation Study

4.1. Configurations of LSA and GSA blocks

Classification performance for different combinations of LSA (L) and GSA (G) blocks based on the small model
  • The models with only LSA fail to obtain good performance (76.9%).
  • An extra global attention layer in the last stage can improve the classification performance by 3.6%.
  • Local-Local-Global (abbr. LLG) also achieves good performance (81.5%).

4.2. Sub-Sampling Functions

Different forms of sub-sampled functions for the global sub-sampled attention (GSA)
  • The first option performs best and therefore it is chosen as the default implementation.

4.3. Positional Encodings

Object detection performance on the COCO using different positional encoding strategies
  • The CPVT-based Swin cannot achieve improved performance with both frameworks, which indicates that the performance improvements should be owing to the paradigm of Twins-SVT instead of the positional encodings.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store