Brief Review — Fast Vision Transformer via Additive Attention

Apply Fastformer Into Vision Transformer

Sik-Ho Tsang
3 min readOct 6, 2024

Fast Vision Transformer via Additive Attention,
Fast Vision Transformer (FViT)
, by Shenzhen University, and Xidian University
2024 CAI (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2023
[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====

Outline

  1. Fast Vision Transformer (FViT)
  2. Results

1. Fast Vision Transformer (FViT)

1.1. Vision Trasformer (ViT)

  • The multi-head self-attention module in ViT:

Eq.(1) and (2) lead to quadratic complexity.

1.2. Fastformer

  • (Please skip this part if you know Fastformer well.)
Additive Attention Module and Fastformer
  • Instead of full self-attention for Q, K, V matrices in Vision Transformer, first, an additive attention module is applied to exploit the query matrix into a global query vector q with the attention weight α:
  • Then an element-wise product between the global query vector and each key vector integrates them into a global context-aware key matrix.
  • Similarly, for computational efficiency, the i-th vector’s additive attention weight and global key vector are calculated:
  • An element-wise product between the global key and value vector is then calculated.
  • A linear transformation layer is then applied to each key-value interaction vector to learn its hidden representation, which is further added with query matrix to form the final output of the model.

Linear complexity is achieved.

2. Results

  • Fastformer (b_32 and b_16 variations) is compared to the ViT with the variants of B/16 and B/32. B/16 and B/32 have a hidden dimension size of 768, MLP dimension of size 3072.
  • The number of heads is 12 and the depth is set as 12.
ImageNet Results

In B/16 variant, ViT achieves 77% Top-1 accuracy, which is better than Fastformer-B/16 with 63%. But Fastformer-B/16 with 79M has less number of parameters than ViT-B/16 with 86M parameters.

  • The Fastformer-B/16 with 45.2 GFLOPs has less computational complexity than ViT-B/16 with 49.3 GFLOPs.

In the B/32 variants, ViT-B/32 achieves 73% Top-1 accuracy, while Fastformer-B/32 65%. But Fastformer-B/32 with 81M parameters has fewer parameters than ViT-B/32 with 88M parameters.

  • The computational cost of Fastformer-B/32’s 11.6 GFLOPs is less complex than ViT-B/32 with 12.6 GFLOPs.
  • (Though authors mentioned that the proposed method with fewer FLOPs obtains comparable performance with the ViT, it seems that the accuracy is largely dropped.)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Responses (1)

Write a response