Brief Review — BotNet: Bottleneck Transformers for Visual Recognition

BotNet, By Only Replacing 3 Bottleneck Blocks With Self-Attention, Improves Performance Significantly

Sik-Ho Tsang
4 min readApr 10, 2023
A taxonomy of deep learning architectures using self-attention for visual recognition. The proposed architecture BoTNet is a hybrid model that uses both convolutions and self-attention.

Bottleneck Transformers for Visual Recognition,
BotNet, by UC Berkeley, and Google Research,
2021 CVPR, Over 580 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • BotNet is proposed by just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet, to improve the performance.


  1. BotNet
  2. Results

1. BotNet

1.1. Bottleneck Transformer

Left: A ResNet Bottleneck Block, Right: A Bottleneck Transformer (BoT) block.

The only difference is the replacement of the spatial 3×3 convolution layer with Multi-Head Self-Attention (MHSA). 4 heads are used.

1.2. Positional Embedding

Multi-Head Self-Attention (MHSA) layer used in the BoT block.

In order to make the attention operation position aware, global (All2all) attention is performed on a 2D feature map with split relative position encodings Rh and Rw for height and width respectively.

  • The highlighted blue boxes (position encodings and the value projection) and multiple heads, are the only three elements that are not present in the Non-Local Neural Network.

1.3. Model Architecture

Architecture of BoTNet-50 (BoT50).
  • The only difference in BoT50 from ResNet-50 (R50) is the use of MHSA layer in c5. For an input resolution of 1024×1024, the MHSA layer in the first block of c5 operates on 64×64 while the remaining two operate on 32×32.
  • BoT50 has only 1.2× more multiple-adds. than R50. The overhead in training throughout is 1.3×. BoT50 also has 1.2× fewer parameters than R50.
  • As for dense prediction, convolution with stride of 2 in first block of c5 is used in R50. In Bot50, a 2×2 average-pooling with a stride 2 for the first BoT block.

1.4. BoTNet-S1 for Image Classification

  • For image classification, the BoTNet design in the c5 block group can be changed to uniformly use a stride of 1 in all the final MHSA layers.

2. Results

2.1. Ablation Study on MS COCO

Ablation for Relative Position Encoding.
  • Individual gains are obtained from content-content interaction (qkT) and content-position interaction (qrT) where q, k, r represent the query, key and relative position encodings respectively.
  • When combined together (qkT+qrT), the gains on both APbb and APmk are additive
  • Using conventional absolute position encodings (qrTabs) does not provide as much gain as relative.

2.2. Comparison with ResNet on MS COCO

Comparing R50, R101, R152, BoT50, BoT101 and BoT152.

e.g.: BoT50 is better than R101 (+0.3% APbb, +0.5% APmk) while it is competitive with R152 on APmk, which shows that long-range dependencies are better captured through attention than stacking convolution layers.

2.3. Larger Resolution on MS COCO

All the models are trained for 72 epochs with a multi-scale jitter of [0.1, 2.0].

BoTNet benefits from training on larger images for all of R50, R101 and R152, surpasses the previous best published single model single scale instance segmentation result from ResNeSt.

2.4. Image Classification on ImageNet

ImageNet results in an improved training setting.
  • Using normal training settings cannot improve the accuracy too much. Improved settings are used: 200 epochs, batch size 4096, weight decay 8e-5, RandAugment (2 layers, magnitude 10), and label smoothing of 0.1.

The gains are much more significant in this setting for both BoT-50 (+0.6%) and BoT-S1–50 (+1.4%).

2.5. Scaling BotNet

All backbones along with ViT and DeiT summarized in the form of scatter-plot and Pareto curves.
  • Similar scaling rules as EfficientNet’s compound scaling rule that mainly increase model depths and increase the image resolutions much slower, is used to design a family of BoTNets.

ResNets and SENets are strong baselines until 83% top-1 accuracy. BoTNets scale better beyond 83% top-1 accuracy.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.