Brief Review — BotNet: Bottleneck Transformers for Visual Recognition
BotNet, By Only Replacing 3 Bottleneck Blocks With Self-Attention, Improves Performance Significantly
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- BotNet is proposed by just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet, to improve the performance.
1.1. Bottleneck Transformer
The only difference is the replacement of the spatial 3×3 convolution layer with Multi-Head Self-Attention (MHSA). 4 heads are used.
- (Please feel free to read Transformer & Non-Local Neural Network for more information of self-attention.)
1.2. Positional Embedding
In order to make the attention operation position aware, global (All2all) attention is performed on a 2D feature map with split relative position encodings Rh and Rw for height and width respectively.
- The highlighted blue boxes (position encodings and the value projection) and multiple heads, are the only three elements that are not present in the Non-Local Neural Network.
1.3. Model Architecture
- The only difference in BoT50 from ResNet-50 (R50) is the use of MHSA layer in c5. For an input resolution of 1024×1024, the MHSA layer in the first block of c5 operates on 64×64 while the remaining two operate on 32×32.
- BoT50 has only 1.2× more multiple-adds. than R50. The overhead in training throughout is 1.3×. BoT50 also has 1.2× fewer parameters than R50.
- As for dense prediction, convolution with stride of 2 in first block of c5 is used in R50. In Bot50, a 2×2 average-pooling with a stride 2 for the first BoT block.
1.4. BoTNet-S1 for Image Classification
- For image classification, the BoTNet design in the c5 block group can be changed to uniformly use a stride of 1 in all the final MHSA layers.
2.1. Ablation Study on MS COCO
- Individual gains are obtained from content-content interaction (qkT) and content-position interaction (qrT) where q, k, r represent the query, key and relative position encodings respectively.
- When combined together (qkT+qrT), the gains on both APbb and APmk are additive
- Using conventional absolute position encodings (qrTabs) does not provide as much gain as relative.
2.2. Comparison with ResNet on MS COCO
e.g.: BoT50 is better than R101 (+0.3% APbb, +0.5% APmk) while it is competitive with R152 on APmk, which shows that long-range dependencies are better captured through attention than stacking convolution layers.
2.3. Larger Resolution on MS COCO
BoTNet benefits from training on larger images for all of R50, R101 and R152, surpasses the previous best published single model single scale instance segmentation result from ResNeSt.
2.4. Image Classification on ImageNet
- Using normal training settings cannot improve the accuracy too much. Improved settings are used: 200 epochs, batch size 4096, weight decay 8e-5, RandAugment (2 layers, magnitude 10), and label smoothing of 0.1.
The gains are much more significant in this setting for both BoT-50 (+0.6%) and BoT-S1–50 (+1.4%).
2.5. Scaling BotNet
- Similar scaling rules as EfficientNet’s compound scaling rule that mainly increase model depths and increase the image resolutions much slower, is used to design a family of BoTNets.