Brief Review — Exploring Plain Vision Transformer Backbones for Object Detection

ViTDet, Plain ViT Backbone With Simple Feature Pyramid

4 min readMay 19, 2024

**ViTDet Uses Plain** **ViT** **Backbone With Simple Feature Pyramid**

Exploring Plain Vision Transformer Backbones for Object Detection
ViTDet, by Facebook AI Research
2022 ECCV, Over 530 Citations (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

ViTDet is proposed to explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network such that minimal adaptations are used for fine-tuning:
A simple feature pyramid is built from a single-scale feature map.
Window attention (without shifting) is used with very few cross-window propagation blocks.
Plain ViT backbones are pre-trained as Masked Autoencoders (MAE).

Outline

ViTDet
Results

1. ViTDet

The goal of this paper is to remove the hierarchical constraint on the backbone and to enable explorations of plain-backbone object detection.

1.1. Simple Feature Pyramid

**Right: Proposed Simple feature pyramid**

FPN and FPN variant produce feature maps with lateral connections.

In this paper, Simple Feature Pyramid (Right) is proposed: Feature maps of scales {1/32, 1/16 , 1/8 , 1/4} are produced using convolutions of strides {2, 1, 1/2 , 1/4}, where a fractional stride indicates a deconvolution.
This scenario enables us to use the original ViT backbone for detection, without redesigning pre-training architectures.

It is found that Simple Feature Pyramid is the best in the case of ViT backbone.

1.2. Window Attention (Without Shifting)

Unlike ViT that uses global attention, and unlike Swin that uses window attention with shifting, in this paper, window attention without shifting is used, which is even simpler. To encourage information propagation:

Global propagation: Global self-attention is used in the last block of each subset. As the number of global blocks is small, the memory and computation cost is feasible.
Convolutional propagation: As an alternative, an extra convolutional block is added after each subset.

This backbone adaptation is simple and makes detection fine-tuning compatible with global self-attention pre-training.

1.3. Self-Supervised Pretraining

Masked Autoencoders (MAE) is used fo self-supervised pretraining.

2. Results

2.1. Ablation Studies

Table 2a: Global and convolutional propagation strategies vs. the no propagation baseline. They have a gain of 1.7 and 1.9 over the baseline.
Table 2b compares different types of residual blocks for convolutional propagation. They all improve over the baseline.
Table 2c studies where cross-window propagation should be located in the backbone. Performing propagation in the last 4 blocks is nearly as good as even placement.
Table 2d compares the number of global propagation blocks to use. The proposed solution of window attention plus a few propagation blocks offers a practical, high-performing tradeoff.

Table 3: Using 4 propagation blocks gives a good trade-off. Convolutional propagation is the most practical. Global self-attention in all 24 blocks is not practical.

Table 4: Masked Autoencoders (MAE) provide strong pre-trained backbones.

2.2. SOTA Comparison

The proposed result with ViT-H is 2.6 better than that with MViTv2-H. Moreover, the plain ViT has a better wall-clock performance as the simpler blocks are more hardware-friendly.

Brief Review — Exploring Plain Vision Transformer Backbones for Object Detection

ViTDet, Plain ViT Backbone With Simple Feature Pyramid

Outline

1. ViTDet

1.1. Simple Feature Pyramid

1.2. Window Attention (Without Shifting)

1.3. Self-Supervised Pretraining

2. Results

2.1. Ablation Studies

2.2. SOTA Comparison

Written by Sik-Ho Tsang