Review — DeepViT: Towards Deeper Vision Transformer

Re-Attention, Increase Attention Map Diversity

Sik-Ho Tsang
4 min readApr 24, 2023
Top-1 accuracy of ViTs and DeepViTs on ImageNet with different network depths {12, 16, 24, 32}.

DeepViT: Towards Deeper Vision Transformer,
DeepViT, by National University of Singapore, and ByteDance US AI Lab,
2021 arXiv v4, Over 290 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • The performance of ViTs saturate fast as shown above.
  • Scaling difficulty is caused by the attention collapse issue: as the Transformer goes deeper, the attention maps gradually become similar.
  • Re-attention is proposed to form DeepViT, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost.

Outline

  1. ViT Attention Map Analysis
  2. DeepViT
  3. Results

1. ViT Attention Map Analysis

  • As for ViT, the classification accuracy improves slowly and saturates fast as the model goes deeper, the attention map is studied.
  • Cross-layer similarity between the attention maps from different layers is estimated:
  • where Mp,q is the cosine similarity matrix between the attention map of layers p and q.
  • Ah,:,t is a T-dimensional vector representing how much the input token t contributes to each of the T output tokens. Mp,q_h,t, thus, provides an appropriate measurement on how the contribution of one token varies from layer p to q.

When Mp,q_h,t equals one, it means that token t plays exactly the same role for self-attention in layers p and q.

(a) The similarity ratio of the generated self-attention maps across different layers. (b) The ratio of similar blocks to the total number of blocks. (c) Similarity of attention maps from different heads within the same block.
  • (a): The similarity ratio is larger than 90% for blocks after the 17th one.
  • (b): The ratio of similar blocks to the total number of blocks increases when the depth of the ViT model increases.
  • (c): While similarity between attention maps across different Transformer blocks is high, especially for deep layers, as in the figure, the similarity between different heads within the blocks is all lower than 30% and they present sufficient diversity. This diversity is used for the proposed re-attention later.
Left: Cross layer similarity of attention map and features for ViTs. Right: Feature map cross layer cosine similarity for both the original ViT model and the proposed one.
  • The similarity is quite high and the learned features stop evolving after the 20th block. There is a close correlation between the increase of attention similarity and feature similarity.

This observation indicates that attention collapse is responsible for the non-scalable issue of ViTs.

Impacts of embedding dimension

One possible solution is to increase the embedding dimension, but this will increase the model size greatly as shown above.

2. DeepViT

Comparison between the (a) original ViT with N Transformer blocks and (b) the proposed DeepViT model

2.1. Attention in the original Transformer

2.2. Proposed Re-Attention in DeepViT

Left: The original self-attention mechanism. Right: The proposed re-attention mechanism.
  • Cross-head communication is proposed to re-generate the attention maps:
  • As shown above, the original attention map is mixed via a learnable matrix before multiplied with values.

Re-attention exploits the interactions among different attention heads to collect their complementary information and better improves the attention map diversity.

3. Results

3.1. Visualizations

Attention map visualization of the selected blocks of the baseline ViT model with 32 Transformer blocks.

After adding Re-attention, the originally similar attention maps are changed to be diverse as shown in the second row. (Only at the last block’s attention map, a nearly uniform attention map is learned.)

3.2. Comparison with ViT

ImageNet Top-1 accuracy of DeepViT models with Re-attention and different number of Transformer blocks.
  • The vanilla ViT architecture suffers performance saturation when adding more Transformer blocks.

When replacing the self-attention with the proposed Re-attention, the number of similar blocks are all reduced to be zero and the performance rises consistently as the model depth increases.

The performance gain is especially significant for DeepViT with 32 blocks.

3.3. SOTA Comparison

SOTA Comparisons

DeepViT model achieves higher accuracy with less parameters than the recent CNN and ViT based models. Notably, without any complicated architecture change as made by T2T-ViT or DeiT. DeepViT-L outperforms them by 0.4 points with even smaller model size (55M vs. 64M & 86 M).

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.