Review — Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Simply Replacing Self-Attention Blocks by Feed Forward Layers

2 min readNov 25, 2022

Do You Even Need Attention? A Stack of Feed-Forward Layers Does SurprisinglyWell on ImageNet, by Oxford University,
2021 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Transformer, Vision Transformer, ViT

While it is suggested that “Attention is All You Need”, author asks us about “Do You Even Need Attention?”.
This is a short report in arXiv to show a strong result obtained by Simply Replacing Self-Attention Blocks by Feed Forward Layers.

Outline

Replacing Self-Attention Blocks by Feed Forward Layers
Results

1. Replacing Self-Attention Blocks by Feed Forward Layers

**The architecture explored in this report is extremely simple, consisting of a patch embedding followed by a series of feed-forward layers.**

The architecture is identical to that of ViT with the attention layer replaced by a feed-forward layer.

These feed-forward layers are alternately applied to the patch and feature dimensions of the image tokens.

2. Results

**comparison of ImageNet top-1 accuracies for different model sizes.**

Notably, the proposed ViT-base-sized model gives 74.9% top-1 accuracy without any hyperparameter tuning (i.e. using the same hyperparameters as its ViT counterpart).

These results suggest that the strong performance of vision transformers may be attributable less to the attention mechanism and more to other factors, such as the inductive bias produced by the patch embedding and the carefully-curated set of training augmentations.

The primary purpose of this report is to explore the limits of simple architectures, not to break the ImageNet benchmarks.

Reference

[2021 arXiv v1] [Do You Even Need Attention?]
Do You Even Need Attention? A Stack of Feed-Forward Layers Does SurprisinglyWell on ImageNet

1.1. Image Classification

1989 …2021 [Do You Even Need Attention?] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer]

Review — Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Simply Replacing Self-Attention Blocks by Feed Forward Layers

Outline

1. Replacing Self-Attention Blocks by Feed Forward Layers

2. Results

Reference

1.1. Image Classification

My Other Previous Paper Readings

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet