Review — Rethinking Spatial Dimensions of Vision Transformers
PiT, Introduces New Pooling Layer for Vision Transformer (ViT)
Rethinking Spatial Dimensions of Vision Transformers,
PiT, by NAVER AI Lab, and Sogang University,
2021 ICCV, Over 300 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
Outline
- Pooling-based Vision Transformer (PiT)
- Results
1. Pooling-based Vision Transformer (PiT)
1.1. Analysis of ResNet50 and ViT
- ResNet reduces the spatial resolution along the architecture.
- ViT does not reduce the spatial resolution along the architecture.
PiT is proposed, by investigating the role of spatial resolution reduction.
ResNet-style dimensions increase the capability of architecture.
1.2. New Pooling Layer in PiT
To introduce pooling into ViT, the pooling layer separates spatial tokens and reshape them into 3D-tensor with spatial structure.
After reshaping, spatial size reduction and channel increase are performed by depth-wise convolution.
And, the responses are reshaped into a 2D matrix for the computation of Transformer blocks.
- ViT does not improve validation accuracy even if training accuracy increases.
PiT alleviates this.
1.3. Attention Analysis
- The attention entropy is defined as:
- where αi,j as (i, j) component of attention matrix A.
- The entropy shows the spread and concentration degree of an attention interaction. A small entropy indicates a concentrated interaction, and a large entropy indicates a spread interaction.
- An attention distance is also measured:
- where pi represents relative spatial location of i-th token for feature map F.
- The attention distance shows a relative ratio compared to the overall feature size, which enables comparison between the different sizes of features.
- The entropy and distance pattern of ViT is similar to the pattern of Transformer in the language domain [35].
- PiT changes the patterns with the spatial dimension setting. At shallow layers (1–2 layers), large spatial size increases the entropy and distance. On the other hand, the entropy and distance are decreased at deep layers (9–11 layers) due to the small spatial size.
In short, the pooling layer of PiT spreads the interaction in the shallow layers and concentrates the interaction in the deep layers.
1.4. Model Variants
Four scales of PiT (tiny, extra small, small, and base) are designed.
2. Results
2.1. ImageNet
PiT has fewer FLOPs and faster speed than ViT. Nevertheless, PiT shows higher performance than ViT.
Table 3: At PiT-B scale, the Transformer-based architecture (ViT-B, PiT-B) outperforms the convolutional architecture.
- Even in the PiT-S scale, PiT-S shows superior performance than convolutional architecture (ResNet50) or outperforms in throughput (EfficientNet-b3).
- But it is weak at a small scale.
Table 4: PiT models show comparable performance with ViT models on the long training scheme. In the large resolution setting, PiT has comparable performance with ViT, but, worse than ViT on throughput.
- In the large resolution setting, PiT has comparable performance with ViT, but, worse than ViT on throughput.
Table 6: PiT shows better performances than ViT in all robustness benchmarks, despite they show comparable performances in the standard ImageNet benchmark (80.8 vs. 79.8).
2.2. MS COCO
- Deformable DETR is used as framework.
Although PiT detector cannot beat the performance of the ResNet50 detector, PiT detector has better latency, and improvement over ViT-S is significant.