Review — DETRs Beat YOLOs on Real-time Object Detection

RT-DETR, Better Trade Off Than YOLOv8, YOLOv7, YOLOv6

Sik-Ho Tsang
6 min readJun 16, 2024
RT-DETR, Better Trade Off Than YOLOv8, YOLOv7, YOLOv6, YOLOv5

DETRs Beat YOLOs on Real-time Object Detection
RT-DETR
, by Baidu Inc, Peking University
2024 CVPR, Over 140 Citations (Sik-Ho Tsang @ Medium)

Object Detection
20142022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] [ViDT] [ViTDet] [PP-YOLOE] 2023 [YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9] [YOLOv10]
==== My Other Paper Readings Are Also Over Here ====

  • Real-Time DEtection TRansformer (RT-DETR) is proposed based on 2 steps: focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy.
  • An efficient hybrid encoder is designed to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed.
  • Then, the uncertainty-minimal query selection is proposed to provide high-quality initial queries to the decoder, thereby improving accuracy.

Outline

  1. Issue in YOLO
  2. RT-DETR
  3. Results

1. Issue in YOLO

Issue in YOLO
  • The execution time of NMS primarily depends on the number of boxes and two thresholds.
  • As the confidence threshold increases, more prediction boxes are filtered out, and the number of remaining boxes that need to calculate IoU decreases, thus reducing the execution time of NMS.
  • Another observation is that anchor-free detectors outperform anchor-based detectors with equivalent accuracy for YOLO detectors because the former require less NMS time than the latter.

2. RT-DETR

2.1. Overview

RT-DETR Overview
  • The features from the last three stages of the backbone {S3,S4,S5} are fed into the encoder.
  • The efficient hybrid encoder transforms multi-scale features into a sequence of image features through intra-scale feature interaction and cross-scale feature fusion.
  • Subsequently, the uncertainty-minimal query selection is employed to select a fixed number of encoder features to serve as initial object queries for the decoder.
  • Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate categories and boxes.

2.2. Efficient Hybrid Encoder

Efficient Hybrid Encoder
  • The encoder accounts for 49% of the GFLOPs but contributes only 11% of the AP in Deformable DETR. The optimal encoder design is needed.
  • A: DINO-Deformable-R50 with the smaller size data reader and lighter decoder.
  • A → B: Variant B inserts a single-scale Transformer encoder into A, which uses one layer of Transformer block. The multi-scale features share the encoder for intra-scale feature interaction and then concatenate as output.
  • B → C: Variant C introduces cross-scale feature fusion based on B and feeds the concatenated features into the multi-scale Transformer encoder to perform simultaneous intra-scale and cross-scale feature interaction.
  • C → D: Variant D decouples intra-scale interaction and cross-scale fusion by utilizing the single-scale Transformer encoder for the former and a PANet-style structure for the latter.

D → E: Variant E enhances the intra-scale interaction and cross-scale fusion based on D, adopting an efficient hybrid encoder designed by authors.

  • 2 components are proposed: Attention-based Intra-scale Feature Interaction (AIFI) and the CNN-based Cross-scale Feature Fusion (CCFF)

2.2.1. AIFI

Specifically, AIFI further reduces the computational cost based on variant D by performing the intra-scale interaction only on S5 with the single-scale Transformer encoder.

  • Using D on S5 not only significantly reduces latency (35% faster), but also improves accuracy (0.4% AP higher).

2.2.2. CCFF

Fusion Block in CCFF
  • The role of the fusion block is to fuse two adjacent scale features into a new feature as above.
  • Two 1 × 1 convolutions are used to adjust the number of channels, N RepBlocks composed of RepConv (RepVGG) are used for feature fusion, and the two-path outputs are fused by element-wise add.
  • The hybrid encoder is formulated below:

2.3. Uncertainty-minimal Query Selection

  • Prior works based on DETR use the confidence score to select the top K features for queries, which leads to considerable level of uncertainty in the selected features, resulting in sub-optimal initialization for the decoder.

The feature uncertainty U is defined as the discrepancy between the predicted distributions of localization P and classification C, and added into loss function.

Classification Score Against IoU Score
  • The purple and green dots represent the selected features from the model trained with uncertainty-minimal query selection and vanilla query selection, respectively.

The purple dots are concentrated in the top right of the figure, while the green dots are concentrated in the bottom right. This shows that uncertainty-minimal query selection produces more high-quality encoder features.

2.4. Scaled RT-DETR

  • Specifically, for the hybrid encoder, the width is controlled by adjusting the embedding dimension and the number of channels, and the depth is controlled by adjusting the number of Transformer layers and RepBlocks.
  • The width and depth of the decoder can be controlled by manipulating the number of object queries and decoder layers.
  • Furthermore, the speed of RT-DETR supports flexible adjustment by adjusting the number of decoder layers.

3. Results

3.1. SOTA Comparisons

SOTA Comparisons
  • Compared to YOLOv5-L / PP-YOLOE-L / YOLOv6-L, RT-DETR-R50 improves accuracy by 4.1% / 1.7% / 0.3% AP, increases FPS by 100.0% / 14.9% / 9.1%, and reduces the number of parameters by 8.7% / 19.2% / 28.8%.
  • Compared to YOLOv5-X / PP-YOLOE-X, RT-DETR-R101 improves accuracy by 3.6% / 2.0%, increases FPS by 72.1% / 23.3%, and reduces the number of parameters by 11.6% / 22.4%.
  • Compared to YOLOv7-L / YOLOv8-L, RT-DETR-R50 improves accuracy by 1.9% / 0.2% AP and increases FPS by 96.4% / 52.1%.
  • Compared to YOLOv7-X / YOLOv8-X, RT-DETR-R101 improves accuracy by 1.4% / 0.4% AP and increases FPS by 64.4% / 48.0%.

This shows that the proposed RT-DETR achieves state-of-the-art real-time detection performance.

  • Compared to DINO-Deformable-DETR-R50, RT-DETR-R50 improves the accuracy by 2.2% AP and the speed by 21 times (108 FPS vs 5 FPS), both of which are significantly improved.

RT-DETR outperforms all DETRs with the same backbone in both speed and accuracy.

3.2. Ablation Studies

Encoder Variants

The proposed hybrid encoder achieves a better trade-off between speed and accuracy.

Query Selection
  • The encoder features selected by uncertainty-minimal query selection not only increase the proportion of high classification scores (0.82% vs 0.35%) but also provide more high-quality features (0.67% vs 0.30%).

The uncertainty-minimal query selection achieves an improvement of 0.8% AP (48.7% AP vs 47.9% AP).

Decoder

RT-DETR supports flexible speed tuning by adjusting the number of decoder layers without retraining, thus improving its practicality.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.