Review — Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
Simply Replacing Self-Attention Blocks by Feed Forward Layers
Do You Even Need Attention? A Stack of Feed-Forward Layers Does SurprisinglyWell on ImageNet, by Oxford University,
2021 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Transformer, Vision Transformer, ViT
- While it is suggested that “Attention is All You Need”, author asks us about “Do You Even Need Attention?”.
- This is a short report in arXiv to show a strong result obtained by Simply Replacing Self-Attention Blocks by Feed Forward Layers.
Outline
- Replacing Self-Attention Blocks by Feed Forward Layers
- Results
1. Replacing Self-Attention Blocks by Feed Forward Layers
- The architecture is identical to that of ViT with the attention layer replaced by a feed-forward layer.
These feed-forward layers are alternately applied to the patch and feature dimensions of the image tokens.
2. Results
- Notably, the proposed ViT-base-sized model gives 74.9% top-1 accuracy without any hyperparameter tuning (i.e. using the same hyperparameters as its ViT counterpart).
These results suggest that the strong performance of vision transformers may be attributable less to the attention mechanism and more to other factors, such as the inductive bias produced by the patch embedding and the carefully-curated set of training augmentations.
- The primary purpose of this report is to explore the limits of simple architectures, not to break the ImageNet benchmarks.
Reference
[2021 arXiv v1] [Do You Even Need Attention?]
Do You Even Need Attention? A Stack of Feed-Forward Layers Does SurprisinglyWell on ImageNet
1.1. Image Classification
1989 …2021 [Do You Even Need Attention?] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer]