Review — GC ViT: Global Context Vision Transformer

GC ViT, Global Query Interacts with Local Key & Value

Sik-Ho Tsang
5 min readJan 27, 2023
Top-1 accuracy vs. model FLOPs/parameter size on ImageNet-1K dataset.

Global Context Vision Transformer,
GC ViT, by NVIDIA,
2022 arXiv v3, Over 5 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT

  • Global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions.

Outline

  1. GC ViT
  2. Results

1. GC ViT

Architecture of the proposed GC ViT.

1.1. Overall Architecture

  • The overall architecture is as above.
  • GC ViT has 4 stages, with downsampler between stages.
  • Within each stage, the local MSA and global MSA are treated as one module and repeated for multiple times.

1.2. Downsampler

Downsampling block for dimension reduction.
  • A modified Fused-MBConv block, followed by a max pooling layer with a kernel size of 3 and stride of 2, is used, where SE, GELU and DW-Conv3×3 denote Squeeze and Excitation block in SENet, Gaussian Error Linear Unit (GELU), and 3×3 depth-wise convolution, respectively.

It provides desirable properties such as inductive bias and modeling of inter-channel dependencies.

1.3. Local Attention & Global Attention Concepts

Local Attention & Global Attention
  • Local attention (Left): is computed on feature patches within local window only.
  • Global attention (Right): On the other hand, the global features are extracted from the entire input features and then repeated to form global query tokens.

The global query is interacted with local key and value tokens, hence allowing to capture long-range information via cross-region interaction.

1.4. Global Query Generator

Global query generator schematic diagram.
  • The global query generator consists of a Fused-MBConv block followed by a max pooling layer. The final global query qg,i at stage i (i is from 1 to 4) of GC ViT is computed according to:

These query tokens are computed once at every stage of the model and shared across all global attention blocks, hence decreasing number of parameters and FLOPs and improving the generalizability.

1.5. Global Self-Attention

Local and global attention blocks. Global attention block does not compute query vector and reuses global query computed via Global Token Generation.
  • Local Self-Attention (Left): can only query patches within a local window, as mentioned.
  • Global Self-Attention (Right): can query different image regions while still operating within the window. The value and key are computed within each local window using a linear layer. The global self-attention query, key and value features are computed as follows:

1.6. GC ViT Variants

Architecture configurations for GC ViT. LG-SA and Conv denotes local, global self-attention and 3×3 convolutional layer, respectively.
  • GC ViT-XT, GC ViT-T, GC ViT-S and GC ViT-B denote XTiny, Tiny, Small and Base variants, respectively.

2. Results

2.1. ImageNet

Image classification benchmarks on ImageNet-1K dataset.
  • The proposed GC ViT surpasses similar-sized counterpart models by +0.5% for GC ViT-XT (82.0%) compared to T2T-ViT-14 (81.5%), +0.7% for GC ViT-T (83.4%) over CSWin-T (82.7%), +0.3% for GC ViT-S (83.9%) over CSWin-S (83.6%), +0.2% for GC ViT-B (84.4%) compared to CSWin-B (84.2%) and +0.1% for GC ViT-L (84.6%) over CoAtNet-3 (84.5%), respectively.

GC ViT models have better or comparable computational efficiency in terms of number FLOPs over the competing counterpart models.

2.2. Object Detection & Instance Segmentation

Object detection and instance segmentation benchmarks using Mask R-CNN and Cascade R-CNN on MS COCO dataset.
  • Using a Mask R-CNN head, the model with pre-trained GC ViT-T (47.9/43.2) backbone outperforms counterparts with pre-trained ConvNeXt-T (46.2/41.7) by +1.7 and +1.5 and Swin-T (46.0/41.6) by +1.9 and +1.6 in terms of box AP and mask AP, respectively.
  • Using a Cascade R-CNN head, the models with pre-trained GC ViT-T (51.6/44.6) and GC ViT-S (52.4/45.4) backbones outperform ConvNeXt-T (50.4/43.7) by +1.2 and +0.9 and ConvNeXt-S (51.9/45.0) by +0.5 and +0.4 in terms of box AP and mask AP, respectively.
  • Furthermore, the model with GC ViT-B (52.9/45.8) backbone outperforms the counterpart with ConvNeXt-B (52.7/45.6) by +0.2 and +0.2 in terms of box AP and mask AP, respectively.

2.3. Semantic Segmentation

Semantic segmentation benchmarks ADE20K validation set with UPerNet and pretrained ImageNet-1K backbone.

GC ViT backbones significantly outperform counterparts with Swin Transformer backbones, hence demonstrating the effectiveness of the global self-attention.

2.4. Ablation Studies

Ablation study on the effectiveness of various components in GC ViT on classification, detection and segmentation performance.

Each component is essential to form GC ViT.

Ablation study on the effectiveness of downsampler in GC ViT architecture on ImageNet Top-1 accuracy.

The modified Fused-MBConv block and strided convolution and shows the best result.

Ablation study on the effectiveness of the proposed global query for classification, detection and segmentation.
  • Instead of global query, two other cases are also tried: (1) global key and value features and interact them with local query (2) global value features and interact it with local query and key.

The proposed global query interacting with local key and value is the best.

Visualization of : (a) input images (b) global self-attention maps from GC ViT-T model © corresponding Grad-CAM attention maps.

Both short and long-range spatial dependencies are captured effectively.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.