Brief Review — ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Vision and Detection Transformers (ViDT), Solves Disadvantages of DETR and YOLOS

Sik-Ho Tsang
3 min readMay 12, 2024
AP and latency (milliseconds) Trade-Off

ViDT: An Efficient and Effective Fully Transformer-based Object Detector
, by NAVER AI Lab, Google Research, University of California at Merced
2022 ICLR, Over 70 Citations (

@ Medium)

Object Detection
20142022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

  • Vision and Detection Transformers (ViDT) is proposed, which introduces a reconfigured attention module (RAM) to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder.


  1. Disadvantages of DETR and YOLOS
  2. ViDT
  3. Results

1. Disadvantages of DETR and YOLOS

  • DETR (ViT) s a straightforward integration of DETR and ViT. However, the attention operation at the neck encoder adds significant computational overhead to the detector.
  • YOLOS achieves a neck-free structure by appending randomly initialized learnable [DET] tokens. Yet, YOLOS inherits the drawback of the canonical ViT; the high computational complexity attributed to the global attention operation. Moreover, YOLOS cannot benefit from using additional techniques essential for better performance, e.g., multi-scale features.

2. ViDT

Right: ViDT

2.1. Reconfigured Attention Module (RAM)

RAM decomposes a single global attention associated with [PATCH] and [DET] tokens into the 3 different attention, namely [PATCH]×[PATCH], [DET]×[DET], and [DET]×[PATCH] attention:

  1. For [PATCH]×[PATCH], the same policy of Swin Transformer is used. Relative position bias is used.
  2. For [DET]×[DET], like YOLOS, 100 learnable [DET] tokens are appended and global self-attention is performed. Learnable positional encoding is used.
  3. For [DET]×[PATCH], it is a cross attention. ViDT binds [DET]×[DET] and [DET]×[PATCH] attention to process them at once to increase efficiency. Sinusodal-based spatial positional encoding is used.
  • All the attention modules in Swin Transformer are replaced with the proposed RAM.
  • ViDT only activates the cross-attention at the last stage. This design choice helps achieve the highest FPS, while achieving similar detection performance.

2.2. Encoder-Free Neck Structure

The decoder receives two inputs from Swin Transformer with RAM: (1) [PATCH] tokens generated from each stage (i.e., four multi-scale feature maps, {xl} where l is from 1 to L, where L = 4) and (2) [DET] tokens generated from the last stage.

2.3. Auxiliary Techniques for Additional Improvements

  • Auxiliary Decoding Loss: Detection heads consisting of two feed-forward networks (FNNs) for box regression and classification are attached to every decoding layer. All the training losses from detection heads at different scales are added to train the model.
  • Iterative Box Refinement: Each decoding layer refines the bounding boxes based on predictions from the detection head in the previous layer.

2.4. Knowledge Distillation With Token Matching

Knowledge can be transferred from the large ViDT model by token matching:

  • where the two sets of tokens are directly related: (1) P: the set of [PATCH] tokens, and (2) D: the set of [DET] tokens.
  • A small ViDT model (a student model) can easily benefit from a pre-trained large ViDT (a teacher model) by matching its tokens with those of the large one.

3. Results

SOTA Comparisons

ViDT achieves the best trade-off between AP and FPS.

  • (Please read the the paper directly for ablation study results.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.