Brief Review — FasterViT: Fast Vision Transformers with Hierarchical Attention
FasterViT: Fast Vision Transformers with Hierarchical Attention
FasterViT, by NVIDIA
2024 ICLR, Over 20 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2]
==== My Other Paper Readings Are Also Over Here ====
- Hierarchical Attention (HAT) approach is proposed, which decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
- Efficient window-based self-attention is used in which each window has access to dedicated carrier tokens that participate in local and global representation learning.
- At a high level, global self-attentions enable the efficient cross-window communication at lower costs.
Outline
- FasterViT
- Results
1. FasterViT
1.1. Overall Architecture
- It exploits convolutional layers in the earlier stages that operate on higher resolution. The second half of the model relies on novel hierarchical attention layers.
1.2. Stem & Downsampler Blocks & Conv BLocks
- Stem: An input image x is converted into overlapping patches by two consecutive 3×3 convolutional layers, each with a stride of 2.
- Downsampler Blocks: The spatial resolution is reduced by 2 between stages by a downsampling block. 2D layer norm, followed by a 3×3 convolutional layer with a stride of 2.
- Conv Blocks: Stage 1 and 2 consist of residual convolutional blocks, wit the use of BN and GELU, which are defined as:
1.3. Hierarchical Attention
- With an input feature map x, the input feature map is first partitioned into n×n local windows with n=H²/k², where k is the window size, as:
The key idea of is the formulation of carrier tokens (CTs) that help to have an attention footprint much larger than a local window at low cost.
- At first, CTs are initialized by pooling to L-2^c tokens per window:
- This conv is an efficient positional encoding, also used in Twins.
- c = 1.
- These pooled tokens represent a summary of their respective local windows.
- In every HAT block, CTs undergo the attention procedure:
- where LN is layer norm. MHSA represents multi-head self attention. MLP is a 2-layer MLP with GELU.
The interaction between the local and carrier tokens, ˆx_l and ˆx_ct,l, respectively, si then computed.
- At first, local features and CTs are concatenated. Each local window only has access to its corresponding CTs:
- These tokens undergo another set of attention procedure:
- Finally, tokens are further split back and used in the subsequent hierarchical attention layers:
- The above procedures are are iteratively applied for a number of layers in the stage.
- To further facilitate long-shot-range interaction, global information propagation is performed in the end of the stage:
1.3. Attention Map Comparison
2. Results
2.1. Image Classification
- Comparing to Conv-based architectures, FasterViT achieves higher accuracy under the same throughput, for example, FasterViT outperforms ConvNeXt-T by 2.2%.
- Considering the accuracy and throughput trade-off, FasterViT models are significantly faster than Transformer-based models such as the family of Swin Transformers.
- Furthermore, compared to hybrid models, such as the recent EfficientFormer and MaxViT (Tu et al., 2022) models, FasterViT on average has a higher throughput while achieving a better ImageNet top-1 performance.
FasterViT-4 has a better accuracy-throughput trade-off compared to other counterparts.
2.2. Downstream Tasks
FasterViT models have better accuracy-throughput trade-off when compared to other counterparts.
Similar to previous tasks, FasterViT models benefit from a better performance-throughput trade-off.