Review — DeepViT: Towards Deeper Vision Transformer
Re-Attention, Increase Attention Map Diversity
DeepViT: Towards Deeper Vision Transformer,
DeepViT, by National University of Singapore, and ByteDance US AI Lab,
2021 arXiv v4, Over 290 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- The performance of ViTs saturate fast as shown above.
- Scaling difficulty is caused by the attention collapse issue: as the Transformer goes deeper, the attention maps gradually become similar.
- Re-attention is proposed to form DeepViT, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost.
Outline
- ViT Attention Map Analysis
- DeepViT
- Results
1. ViT Attention Map Analysis
- As for ViT, the classification accuracy improves slowly and saturates fast as the model goes deeper, the attention map is studied.
- Cross-layer similarity between the attention maps from different layers is estimated:
- where Mp,q is the cosine similarity matrix between the attention map of layers p and q.
- Ah,:,t is a T-dimensional vector representing how much the input token t contributes to each of the T output tokens. Mp,q_h,t, thus, provides an appropriate measurement on how the contribution of one token varies from layer p to q.
When Mp,q_h,t equals one, it means that token t plays exactly the same role for self-attention in layers p and q.
- (a): The similarity ratio is larger than 90% for blocks after the 17th one.
- (b): The ratio of similar blocks to the total number of blocks increases when the depth of the ViT model increases.
- (c): While similarity between attention maps across different Transformer blocks is high, especially for deep layers, as in the figure, the similarity between different heads within the blocks is all lower than 30% and they present sufficient diversity. This diversity is used for the proposed re-attention later.
- The similarity is quite high and the learned features stop evolving after the 20th block. There is a close correlation between the increase of attention similarity and feature similarity.
This observation indicates that attention collapse is responsible for the non-scalable issue of ViTs.
One possible solution is to increase the embedding dimension, but this will increase the model size greatly as shown above.
2. DeepViT
2.1. Attention in the original Transformer
2.2. Proposed Re-Attention in DeepViT
- Cross-head communication is proposed to re-generate the attention maps:
- As shown above, the original attention map is mixed via a learnable matrix before multiplied with values.
Re-attention exploits the interactions among different attention heads to collect their complementary information and better improves the attention map diversity.
3. Results
3.1. Visualizations
After adding Re-attention, the originally similar attention maps are changed to be diverse as shown in the second row. (Only at the last block’s attention map, a nearly uniform attention map is learned.)
3.2. Comparison with ViT
- The vanilla ViT architecture suffers performance saturation when adding more Transformer blocks.
When replacing the self-attention with the proposed Re-attention, the number of similar blocks are all reduced to be zero and the performance rises consistently as the model depth increases.
The performance gain is especially significant for DeepViT with 32 blocks.
3.3. SOTA Comparison
DeepViT model achieves higher accuracy with less parameters than the recent CNN and ViT based models. Notably, without any complicated architecture change as made by T2T-ViT or DeiT. DeepViT-L outperforms them by 0.4 points with even smaller model size (55M vs. 64M & 86 M).