Review — Twins: Revisiting the Design of Spatial Attention in Vision Transformers
Twins-PCPVT & Twins-SVT, On Par With or Outperforms Swin Transformer
Twins: Revisiting the Design of Spatial Attention in Vision Transformers
Twins, by Meituan Inc., and The University of Adelaide
2021 NeurIPS, Over 160 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT, Transformer, Swin
- Two Vision Transformer architectures, Twins-PCPVT and Twins-SVT, are proposed.
- Twins-PCPVT: A global sub-sampled attention applied in PVT, and with the applicable positional encodings.
- Twins-SVT: Spatially Separable Self-Attention (SSSA) is proposed, which is composed of two types of attention operations — (i) Locally-grouped Self-Attention (LSA), and (ii) Global Sub-sampled Attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information.
- (For quick read, please read 1, 2, and 3.1.)
- SOTA Comparisons
- Ablation Study
1.1. Problems in PVT
- It is found that the less favored performance of PVT is mainly due to the absolute positional encodings employed in PVT.
- This absolute positional encoding encounter difficulties in processing the inputs with varying sizes. Moreover, this positional encoding also breaks the translation invariance.
- On the contrary, Swin transformer makes use of the relative positional encodings. this is the main cause why Swin outperforms PVT.
- The conditional position encoding (CPE) proposed in CPVT is used to replace the absolute PE in PVT.
- The position encoding generator (PEG) in CPVT, which generates the CPE, is placed after the first encoder block of each stage.
- The simplest form of PEG, i.e., a 2D depth-wise convolution without batch normalization.
- For image-level classification, following CPVT, the class token is removed and global average pooling (GAP) is used at the end of the stage.
- (Please feel free to read PVT and CPVT if interested.)
2.1. Problems in ViT
- Vision Transformers suffer severely from the heavy computational complexity in dense prediction tasks due to high-resolution inputs.
- Given an input of H×W resolution, the complexity of self-attention with dimension d is:
- where H=W=224 is popular in classification.
Here, Twins propose the spatially separable self-attention (SSSA) to alleviate this challenge. SSSA is composed of locally-grouped self-attention (LSA) and global sub-sampled attention (GSA).
2.2. Locally-grouped Self-Attention (LSA)
- The 2D feature maps are first equally divided into sub-windows, making self-attention communications only happen within each sub-window.
- To be specific, the feature maps are divided into m×n sub-windows.
- Each group contains HW/mn elements, and thus the computation cost of the self-attention in this window is:
- And total cost is:
- If k1=H/m and k2=W/n, then the total cost become:
And it is significantly more efficient when k1≪H and k2≪W and grows linearly with HW if k1 and k2 are fixed.
- Yet, a mechanism to communicate between different sub-windows is still needed. Otherwise, the information would be limited to be processed locally.
2.3. Global Sub-sampled Attention (GSA)
- A single representative for each window is used, which summarizes the important information for each f m×n sub-windows.
- This representative is used to communicate with other sub-windows (serving as the key in self-attention), which can dramatically reduce the cost to:
- This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations.
2.4. Spatially Separable Self-Attention (SSSA): Combine LSA and GSA
- If we alternatively use the aforementioned LSA and GSA like separable convolutions (depth-wise+point-wise), the total cost is:
- The minimum is obtained when:
- k1=k2=15 is close to the global minimum for H=W=224.
- Stage 1 has feature maps of 56×56, the minimum is obtained when k1=k2=7.
- As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4, 2 and 1 are used for the last three stages respectively.
- As for the sub-sampling function, several options are investigated, including average pooling, depthwise strided convolutions, and regular strided convolutions. Empirical results show that regular strided convolutions perform best here.
- The overall SSSA, which combines LSA and GSA, is as below:
- where FFN is feed-forward network.
- Both LSA and GSA have multiple heads as in the standard self-attention.
- PEG of CPVT is also used, and it is inserted into after the first block in each stage.
- There are S, B and L variants.
Comparing with Swin Transformer, the cyclic shift operation is memory unfriendly and rarely supported by popular inference frameworks. In contrast, Twins models don’t require such an operation and only involve matrix multiplications.
- For example, When Twins-SVT-S is converted from PyTorch to TensorRT, and its throughput is boosted by 1.7×.
3. SOTA Comparisons
3.1. Classification on ImageNet-1K
- Twins-PCPVT-S outperforms PVT-small by 1.4% and obtains similar result as Swin-T with 18% fewer FLOPs.
- Twins-SVT-S is better than Swin-T with about 35% fewer FLOPs.
- Other models demonstrate similar advantages.
- Twins-PCPVT performs on par with the recent state-of-the-art Swin, which is based on much more sophisticated designs as mentioned above.
- Moreover, Twins-SVT also achieves similar or better results, compared to Swin, indicating that the spatial separable-like design is an effective and promising paradigm.
3.2. Semantic Segmentation on ADE20K
- Semantic FPN framework is used.
- With comparable FLOPs, Twins-PCPVT-S outperforms PVT-Small with a large margin (+4.5% mIoU), which also surpasses ResNet-50 by 7.6% mIoU. It also outperforms Swin-T with a clear margin.
- Besides, Twins-PCPVT-B also achieves 3.3% higher mIoU than PVT-Medium, and Twins-PCPVT-L surpasses PVT-Large with 4.3% higher mIoU.
- Twins-SVT-S achieves better performance (+1.7%) than Swin-T. Twins-SVT-B obtains comparable performance with Swin-S and Twins-SVT-L outperforms Swin-B by 0.7% mIoU where Swin uses UPerNet framework.
- Twins-SVT-S outperforms Swin-T by 1.3% mIoU. Moreover, Twins-SVT-L achieves new state of the art result 50.2% mIoU under comparable FLOPs and outperforms Swin-B by 0.5% mIoU.
- Twins-PCPVT also achieves comparable performance to Swin.
3.3. Object Detection and Segmentation on COCO
- For 1× schedule object detection with RetinaNet, Twins-PCPVT-S surpasses PVT-Small with 2.6% mAP and Twins-PCPVT-B exceeds PVT-Medium by 2.4% mAP on the COCO val2017 split.
- Twins-SVT-S outperforms Swin-T with 1.5% mAP while using 12% fewer FLOPs.
- Twins also outperforms the others with similar advantage in 3× experiments.
- For 1× object segmentation with the Mask R-CNN framework, Twins-PCPVT-S brings similar improvements (+2.5% mAP) over PVT-Small.
- Compared with PVT-Medium, Twins-PCPVT-B obtains 2.6% higher mAP, which is also on par with that of Swin.
- Both Twins-SVT-S and Twins-SVT-B achieve better or slightly better performance compared to the counterparts of Swin.
4. Ablation Study
4.1. Configurations of LSA and GSA blocks
- The models with only LSA fail to obtain good performance (76.9%).
- An extra global attention layer in the last stage can improve the classification performance by 3.6%.
- Local-Local-Global (abbr. LLG) also achieves good performance (81.5%).
4.2. Sub-Sampling Functions
- The first option performs best and therefore it is chosen as the default implementation.
4.3. Positional Encodings
[2021 NeurIPS] [Twins]
Twins: Revisiting the Design of Spatial Attention in Vision Transformers
2021 [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer] [CaiT] [ResMLP] [ResNet-RS] [NFNet] [PVT, PVTv1] [CvT] [HaloNet] [TNT] [CoAtNet] [Focal Transformer] [TResNet] [CPVT] [Twins] 2022 [ConvNeXt]