Brief Review — Fast Vision Transformer via Additive Attention

Apply Fastformer Into Vision Transformer

3 min readOct 6, 2024

Fast Vision Transformer via Additive Attention,
Fast Vision Transformer (FViT), by Shenzhen University, and Xidian University
2024 CAI (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====

Last week, Fastformer is mentioned.
This time, the additive attention module in Fastformer is applied into Vision Transformer (ViT) to become Fast Vision Transformer (FViT).

Outline

Fast Vision Transformer (FViT)
Results

1. Fast Vision Transformer (FViT)

1.1. Vision Trasformer (ViT)

The multi-head self-attention module in ViT:

Eq.(1) and (2) lead to quadratic complexity.

1.2. Fastformer

(Please skip this part if you know Fastformer well.)

**Additive Attention Module and** **Fastformer**

Instead of full self-attention for Q, K, V matrices in Vision Transformer, first, an additive attention module is applied to exploit the query matrix into a global query vector q with the attention weight α:

Then an element-wise product between the global query vector and each key vector integrates them into a global context-aware key matrix.
Similarly, for computational efficiency, the i-th vector’s additive attention weight and global key vector are calculated:

An element-wise product between the global key and value vector is then calculated.
A linear transformation layer is then applied to each key-value interaction vector to learn its hidden representation, which is further added with query matrix to form the final output of the model.

Linear complexity is achieved.

2. Results

Fastformer (b_32 and b_16 variations) is compared to the ViT with the variants of B/16 and B/32. B/16 and B/32 have a hidden dimension size of 768, MLP dimension of size 3072.
The number of heads is 12 and the depth is set as 12.

In B/16 variant, ViT achieves 77% Top-1 accuracy, which is better than Fastformer-B/16 with 63%. But Fastformer-B/16 with 79M has less number of parameters than ViT-B/16 with 86M parameters.

The Fastformer-B/16 with 45.2 GFLOPs has less computational complexity than ViT-B/16 with 49.3 GFLOPs.

In the B/32 variants, ViT-B/32 achieves 73% Top-1 accuracy, while Fastformer-B/32 65%. But Fastformer-B/32 with 81M parameters has fewer parameters than ViT-B/32 with 88M parameters.

The computational cost of Fastformer-B/32’s 11.6 GFLOPs is less complex than ViT-B/32 with 12.6 GFLOPs.
(Though authors mentioned that the proposed method with fewer FLOPs obtains comparable performance with the ViT, it seems that the accuracy is largely dropped.)

Brief Review — Fast Vision Transformer via Additive Attention

Apply Fastformer Into Vision Transformer

Outline

1. Fast Vision Transformer (FViT)

1.1. Vision Trasformer (ViT)

1.2. Fastformer

2. Results

Written by Sik-Ho Tsang