Brief Review — Fast Vision Transformer via Additive Attention

Apply Fastformer Into Vision Transformer

Sik-Ho Tsang
3 min readOct 6, 2024

Fast Vision Transformer via Additive Attention,
Fast Vision Transformer (FViT)
, by Shenzhen University, and Xidian University
2024 CAI (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====

Outline

  1. Fast Vision Transformer (FViT)
  2. Results

1. Fast Vision Transformer (FViT)

1.1. Vision Trasformer (ViT)

  • The multi-head self-attention module in ViT:

Eq.(1) and (2) lead to quadratic complexity.

1.2. Fastformer

  • (Please skip this part if you know Fastformer well.)
Additive Attention Module and Fastformer
  • Instead of full self-attention for Q, K, V matrices in Vision Transformer, first, an additive attention module is applied to exploit the query matrix into a global query vector q with the attention weight α:
  • Then an element-wise product between the global query vector and each key vector integrates them into a global context-aware key matrix.
  • Similarly, for computational efficiency, the i-th vector’s additive attention weight and global key vector are calculated:
  • An element-wise product between the global key and value vector is then calculated.
  • A linear transformation layer is then applied to each key-value interaction vector to learn its hidden representation, which is further added with query matrix to form the final output of the model.

Linear complexity is achieved.

2. Results

  • Fastformer (b_32 and b_16 variations) is compared to the ViT with the variants of B/16 and B/32. B/16 and B/32 have a hidden dimension size of 768, MLP dimension of size 3072.
  • The number of heads is 12 and the depth is set as 12.
ImageNet Results

In B/16 variant, ViT achieves 77% Top-1 accuracy, which is better than Fastformer-B/16 with 63%. But Fastformer-B/16 with 79M has less number of parameters than ViT-B/16 with 86M parameters.

  • The Fastformer-B/16 with 45.2 GFLOPs has less computational complexity than ViT-B/16 with 49.3 GFLOPs.

In the B/32 variants, ViT-B/32 achieves 73% Top-1 accuracy, while Fastformer-B/32 65%. But Fastformer-B/32 with 81M parameters has fewer parameters than ViT-B/32 with 88M parameters.

  • The computational cost of Fastformer-B/32’s 11.6 GFLOPs is less complex than ViT-B/32 with 12.6 GFLOPs.
  • (Though authors mentioned that the proposed method with fewer FLOPs obtains comparable performance with the ViT, it seems that the accuracy is largely dropped.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.