Review — GC ViT: Global Context Vision Transformer
- Global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions.
- GC ViT
1. GC ViT
1.1. Overall Architecture
- The overall architecture is as above.
- GC ViT has 4 stages, with downsampler between stages.
- Within each stage, the local MSA and global MSA are treated as one module and repeated for multiple times.
- A modified Fused-MBConv block, followed by a max pooling layer with a kernel size of 3 and stride of 2, is used, where SE, GELU and DW-Conv3×3 denote Squeeze and Excitation block in SENet, Gaussian Error Linear Unit (GELU), and 3×3 depth-wise convolution, respectively.
It provides desirable properties such as inductive bias and modeling of inter-channel dependencies.
1.3. Local Attention & Global Attention Concepts
- Local attention (Left): is computed on feature patches within local window only.
- Global attention (Right): On the other hand, the global features are extracted from the entire input features and then repeated to form global query tokens.
The global query is interacted with local key and value tokens, hence allowing to capture long-range information via cross-region interaction.
1.4. Global Query Generator
- The global query generator consists of a Fused-MBConv block followed by a max pooling layer. The final global query qg,i at stage i (i is from 1 to 4) of GC ViT is computed according to:
These query tokens are computed once at every stage of the model and shared across all global attention blocks, hence decreasing number of parameters and FLOPs and improving the generalizability.
1.5. Global Self-Attention
- Local Self-Attention (Left): can only query patches within a local window, as mentioned.
- Global Self-Attention (Right): can query different image regions while still operating within the window. The value and key are computed within each local window using a linear layer. The global self-attention query, key and value features are computed as follows:
1.6. GC ViT Variants
- GC ViT-XT, GC ViT-T, GC ViT-S and GC ViT-B denote XTiny, Tiny, Small and Base variants, respectively.
- The proposed GC ViT surpasses similar-sized counterpart models by +0.5% for GC ViT-XT (82.0%) compared to T2T-ViT-14 (81.5%), +0.7% for GC ViT-T (83.4%) over CSWin-T (82.7%), +0.3% for GC ViT-S (83.9%) over CSWin-S (83.6%), +0.2% for GC ViT-B (84.4%) compared to CSWin-B (84.2%) and +0.1% for GC ViT-L (84.6%) over CoAtNet-3 (84.5%), respectively.
GC ViT models have better or comparable computational efficiency in terms of number FLOPs over the competing counterpart models.
2.2. Object Detection & Instance Segmentation
- Using a Mask R-CNN head, the model with pre-trained GC ViT-T (47.9/43.2) backbone outperforms counterparts with pre-trained ConvNeXt-T (46.2/41.7) by +1.7 and +1.5 and Swin-T (46.0/41.6) by +1.9 and +1.6 in terms of box AP and mask AP, respectively.
- Using a Cascade R-CNN head, the models with pre-trained GC ViT-T (51.6/44.6) and GC ViT-S (52.4/45.4) backbones outperform ConvNeXt-T (50.4/43.7) by +1.2 and +0.9 and ConvNeXt-S (51.9/45.0) by +0.5 and +0.4 in terms of box AP and mask AP, respectively.
- Furthermore, the model with GC ViT-B (52.9/45.8) backbone outperforms the counterpart with ConvNeXt-B (52.7/45.6) by +0.2 and +0.2 in terms of box AP and mask AP, respectively.
2.3. Semantic Segmentation
GC ViT backbones significantly outperform counterparts with Swin Transformer backbones, hence demonstrating the effectiveness of the global self-attention.
2.4. Ablation Studies
Each component is essential to form GC ViT.
The modified Fused-MBConv block and strided convolution and shows the best result.
- Instead of global query, two other cases are also tried: (1) global key and value features and interact them with local query (2) global value features and interact it with local query and key.
The proposed global query interacting with local key and value is the best.
Both short and long-range spatial dependencies are captured effectively.
[2022 arXiv v3] [GC ViT]
Global Context Vision Transformer