Review — Rethinking Spatial Dimensions of Vision Transformers

PiT, Introduces New Pooling Layer for Vision Transformer (ViT)

Sik-Ho Tsang
4 min readApr 27, 2023

Rethinking Spatial Dimensions of Vision Transformers,
PiT, by NAVER AI Lab, and Sogang University,
2021 ICCV, Over 300 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • By investigating the role of spatial dimension conversion in ViT, Pooling-based Vision Transformer (PiT) is proposed with a new pooling layer introduced in ViT.

Outline

  1. Pooling-based Vision Transformer (PiT)
  2. Results

1. Pooling-based Vision Transformer (PiT)

1.1. Analysis of ResNet50 and ViT

Pooling Layers are Introduced in PiT
  • ResNet reduces the spatial resolution along the architecture.
  • ViT does not reduce the spatial resolution along the architecture.

PiT is proposed, by investigating the role of spatial resolution reduction.

Analysis of ResNet50 and ViT

ResNet-style dimensions increase the capability of architecture.

1.2. New Pooling Layer in PiT

Proposed Pooling Layer of PiT

To introduce pooling into ViT, the pooling layer separates spatial tokens and reshape them into 3D-tensor with spatial structure.

After reshaping, spatial size reduction and channel increase are performed by depth-wise convolution.

And, the responses are reshaped into a 2D matrix for the computation of Transformer blocks.

Analysis of ViT and PiT
  • ViT does not improve validation accuracy even if training accuracy increases.

PiT alleviates this.

1.3. Attention Analysis

  • The attention entropy is defined as:
  • where αi,j as (i, j) component of attention matrix A.
  • The entropy shows the spread and concentration degree of an attention interaction. A small entropy indicates a concentrated interaction, and a large entropy indicates a spread interaction.
  • An attention distance is also measured:
  • where pi represents relative spatial location of i-th token for feature map F.
  • The attention distance shows a relative ratio compared to the overall feature size, which enables comparison between the different sizes of features.
Attention Analysis
  • The entropy and distance pattern of ViT is similar to the pattern of Transformer in the language domain [35].
  • PiT changes the patterns with the spatial dimension setting. At shallow layers (1–2 layers), large spatial size increases the entropy and distance. On the other hand, the entropy and distance are decreased at deep layers (9–11 layers) due to the small spatial size.

In short, the pooling layer of PiT spreads the interaction in the shallow layers and concentrates the interaction in the deep layers.

1.4. Model Variants

PiT Model Variants

Four scales of PiT (tiny, extra small, small, and base) are designed.

2. Results

2.1. ImageNet

Comparison with ViT on ImageNet

PiT has fewer FLOPs and faster speed than ViT. Nevertheless, PiT shows higher performance than ViT.

ImageNet Performance

Table 3: At PiT-B scale, the Transformer-based architecture (ViT-B, PiT-B) outperforms the convolutional architecture.

  • Even in the PiT-S scale, PiT-S shows superior performance than convolutional architecture (ResNet50) or outperforms in throughput (EfficientNet-b3).
  • But it is weak at a small scale.

Table 4: PiT models show comparable performance with ViT models on the long training scheme. In the large resolution setting, PiT has comparable performance with ViT, but, worse than ViT on throughput.

  • In the large resolution setting, PiT has comparable performance with ViT, but, worse than ViT on throughput.

Table 6: PiT shows better performances than ViT in all robustness benchmarks, despite they show comparable performances in the standard ImageNet benchmark (80.8 vs. 79.8).

2.2. MS COCO

Object Detection

Although PiT detector cannot beat the performance of the ResNet50 detector, PiT detector has better latency, and improvement over ViT-S is significant.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.