Brief Review — Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR

Lite DETR, More Attention on High-Level Features

Sik-Ho Tsang
4 min readMay 28, 2024
AP Against GFLOPs

Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR
Lite DETR
, by The Hong Kong University of Science and Technology, International Digital Economy Academy (IDEA), Tsinghua University, South China University of Science and Technology, The Hong Kong University of Science and Technology (Guangzhou)
2023 CVPR, Over 30 Citations (Sik-Ho Tsang @ Medium)

Object Detection
20142021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

  • Lite DETR is proposed, where an efficient encoder block is designed to update high-level features (corresponding to smallresolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way.
  • In addition, to better fuse cross-scale features, a key-aware deformable attention (KDA) is developed to predict more reliable attention weights.

Outline

  1. Problem in DETR
  2. Lite DETR
  3. Results

1. Problem in DETR

  • In vanilla DETR, the number of tokens in low-level features is extremely high, whereas the 3 higher-level scales account for only about 25%.
  • However, high-level tokens contain compact information and rich semantics to detect most objects.

2. Lite DETR

Lite DETR
  • Following the multi-scale Deformable DETR, Lite DETR is composed of a backbone, a multi-layer encoder, and a multi-layer decoder with prediction heads.

2.1. Interleaved Update

  • The bottleneck towards an efficient encoder is excessive low-level features. Therefore, authors propose to prioritize different scales of the features in an interleaved manner to achieve a precision and efficiency trade-off.
  • The efficient encoder block is stacked for B times, where each block updates high-level features for A times but only updates low-level features once at the end of the block. In this way, a full-scale feature pyramid is still maintained with a much lower computation cost.

2.2. Iterative High-level Feature Cross-Scale Fusion

  • The high-level features FH will serve as queries (Q) to extract features from all-scales, including the low-level and high-level feature tokens
  • F’H are the updated high-level features, and concatenated with FL.
  • In addition, Key-aware Deformable Attention (KDA), is used, which will be introduced later below.

2.3. Efficient Low-level Feature Cross-Scale Fusion

  • The initial low-level features are utilized as queries to interact with the updated high-level tokens as well as the original low-level features to update their representation.
  • KDA is also used.

2.4. Key-aware Deformable Attention (KDA)

Key-aware Deformable Attention (KDA)
  • Different from Deformable DETR, both V and K are also generated by using Δp sampling the important points:

3. Results

3.1. Efficiency Improvements on Deformable DETR

Results for single-scale DETR-based models
  • There are different vairants for Lite DETR, e.g.: Lite-DINO H3L1-(3+1)2 indicates the model is based on DINO to use three high-level feature scales (H3L1) and two efficient encoder blocks with three high-level fusion ((3+1)2).

Lite DETRs achieve comparable performance as Deformable DETR with around 40% of the original encoder GFLOPs.

3.2. Efficiency Improvements on Other DETR-based Models

Results for Deformable DETR-based models

Lite DETR achieves significantly better performance with comparable computational cost. In addition, after plugging in the proposed efficient encoder, the encoder GFLOPs can be reduced by 78%  62% compared to the original ones while keeping 99% of the original performance.

3.3. Visualization of KDA

Visualization of KDA
  • Compared with deformable attention, as KDA introduces keys, KDA attention can predict more reliable weights, especially on low-level feature maps.

In Fig. 5(b) and (c), it is observed that it is difficult for deformable attention to focus on meaningful regions on the largest scale map S4 in our interleaved encoder. KDA effectively mitigates this phenomenon.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.