Review — Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Simply Replacing Self-Attention Blocks by Feed Forward Layers

Sik-Ho Tsang
2 min readNov 25, 2022

Do You Even Need Attention? A Stack of Feed-Forward Layers Does SurprisinglyWell on ImageNet, by Oxford University,
2021 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Transformer, Vision Transformer, ViT

  • While it is suggested that “Attention is All You Need”, author asks us about “Do You Even Need Attention?”.
  • This is a short report in arXiv to show a strong result obtained by Simply Replacing Self-Attention Blocks by Feed Forward Layers.

Outline

  1. Replacing Self-Attention Blocks by Feed Forward Layers
  2. Results

1. Replacing Self-Attention Blocks by Feed Forward Layers

The architecture explored in this report is extremely simple, consisting of a patch embedding followed by a series of feed-forward layers.
  • The architecture is identical to that of ViT with the attention layer replaced by a feed-forward layer.

These feed-forward layers are alternately applied to the patch and feature dimensions of the image tokens.

2. Results

comparison of ImageNet top-1 accuracies for different model sizes.
  • Notably, the proposed ViT-base-sized model gives 74.9% top-1 accuracy without any hyperparameter tuning (i.e. using the same hyperparameters as its ViT counterpart).

These results suggest that the strong performance of vision transformers may be attributable less to the attention mechanism and more to other factors, such as the inductive bias produced by the patch embedding and the carefully-curated set of training augmentations.

  • The primary purpose of this report is to explore the limits of simple architectures, not to break the ImageNet benchmarks.

Reference

[2021 arXiv v1] [Do You Even Need Attention?]
Do You Even Need Attention? A Stack of Feed-Forward Layers Does SurprisinglyWell on ImageNet

1.1. Image Classification

19892021 [Do You Even Need Attention?] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.