Brief Review — Fast Vision Transformer via Additive Attention
Apply Fastformer Into Vision Transformer
Fast Vision Transformer via Additive Attention,
Fast Vision Transformer (FViT), by Shenzhen University, and Xidian University
2024 CAI (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====
- Last week, Fastformer is mentioned.
- This time, the additive attention module in Fastformer is applied into Vision Transformer (ViT) to become Fast Vision Transformer (FViT).
Outline
- Fast Vision Transformer (FViT)
- Results
1. Fast Vision Transformer (FViT)
1.1. Vision Trasformer (ViT)
- The multi-head self-attention module in ViT:
Eq.(1) and (2) lead to quadratic complexity.
1.2. Fastformer
- (Please skip this part if you know Fastformer well.)
- Instead of full self-attention for Q, K, V matrices in Vision Transformer, first, an additive attention module is applied to exploit the query matrix into a global query vector q with the attention weight α:
- Then an element-wise product between the global query vector and each key vector integrates them into a global context-aware key matrix.
- Similarly, for computational efficiency, the i-th vector’s additive attention weight and global key vector are calculated:
- An element-wise product between the global key and value vector is then calculated.
- A linear transformation layer is then applied to each key-value interaction vector to learn its hidden representation, which is further added with query matrix to form the final output of the model.
Linear complexity is achieved.
2. Results
- Fastformer (b_32 and b_16 variations) is compared to the ViT with the variants of B/16 and B/32. B/16 and B/32 have a hidden dimension size of 768, MLP dimension of size 3072.
- The number of heads is 12 and the depth is set as 12.
In B/16 variant, ViT achieves 77% Top-1 accuracy, which is better than Fastformer-B/16 with 63%. But Fastformer-B/16 with 79M has less number of parameters than ViT-B/16 with 86M parameters.
- The Fastformer-B/16 with 45.2 GFLOPs has less computational complexity than ViT-B/16 with 49.3 GFLOPs.
In the B/32 variants, ViT-B/32 achieves 73% Top-1 accuracy, while Fastformer-B/32 65%. But Fastformer-B/32 with 81M parameters has fewer parameters than ViT-B/32 with 88M parameters.
- The computational cost of Fastformer-B/32’s 11.6 GFLOPs is less complex than ViT-B/32 with 12.6 GFLOPs.
- (Though authors mentioned that the proposed method with fewer FLOPs obtains comparable performance with the ViT, it seems that the accuracy is largely dropped.)