Brief Review — Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR

Lite DETR, More Attention on High-Level Features

4 min readMay 28, 2024

Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR
Lite DETR, by The Hong Kong University of Science and Technology, International Digital Economy Academy (IDEA), Tsinghua University, South China University of Science and Technology, The Hong Kong University of Science and Technology (Guangzhou)
2023 CVPR, Over 30 Citations (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] [YOLOv6] 2023 [YOLOv7] [YOLOv8] 2024 [YOLOv9]
==== My Other Paper Readings Are Also Over Here ====

Lite DETR is proposed, where an efficient encoder block is designed to update high-level features (corresponding to smallresolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way.
In addition, to better fuse cross-scale features, a key-aware deformable attention (KDA) is developed to predict more reliable attention weights.

Outline

Problem in DETR
Lite DETR
Results

1. Problem in DETR

In vanilla DETR, the number of tokens in low-level features is extremely high, whereas the 3 higher-level scales account for only about 25%.
However, high-level tokens contain compact information and rich semantics to detect most objects.

2. Lite DETR

Following the multi-scale Deformable DETR, Lite DETR is composed of a backbone, a multi-layer encoder, and a multi-layer decoder with prediction heads.

2.1. Interleaved Update

The bottleneck towards an efficient encoder is excessive low-level features. Therefore, authors propose to prioritize different scales of the features in an interleaved manner to achieve a precision and efficiency trade-off.
The efficient encoder block is stacked for B times, where each block updates high-level features for A times but only updates low-level features once at the end of the block. In this way, a full-scale feature pyramid is still maintained with a much lower computation cost.

2.2. Iterative High-level Feature Cross-Scale Fusion

The high-level features FH will serve as queries (Q) to extract features from all-scales, including the low-level and high-level feature tokens

F’H are the updated high-level features, and concatenated with FL.
In addition, Key-aware Deformable Attention (KDA), is used, which will be introduced later below.

2.3. Efficient Low-level Feature Cross-Scale Fusion

The initial low-level features are utilized as queries to interact with the updated high-level tokens as well as the original low-level features to update their representation.

KDA is also used.

2.4. Key-aware Deformable Attention (KDA)

Similar to Deformable DETR, Δp is generated from Q:

Different from Deformable DETR, both V and K are also generated by using Δp sampling the important points:

3. Results

3.1. Efficiency Improvements on Deformable DETR

**Results for single-scale** **DETR-based models**

There are different vairants for Lite DETR, e.g.: Lite-DINO H3L1-(3+1)2 indicates the model is based on DINO to use three high-level feature scales (H3L1) and two efficient encoder blocks with three high-level fusion ((3+1)2).

Lite DETRs achieve comparable performance as Deformable DETR with around 40% of the original encoder GFLOPs.

3.2. Efficiency Improvements on Other DETR-based Models

**Results for** **Deformable DETR-based models**

Lite DETR achieves significantly better performance with comparable computational cost. In addition, after plugging in the proposed efficient encoder, the encoder GFLOPs can be reduced by 78% 62% compared to the original ones while keeping 99% of the original performance.

3.3. Visualization of KDA

Compared with deformable attention, as KDA introduces keys, KDA attention can predict more reliable weights, especially on low-level feature maps.

In Fig. 5(b) and (c), it is observed that it is difficult for deformable attention to focus on meaningful regions on the largest scale map S4 in our interleaved encoder. KDA effectively mitigates this phenomenon.

Brief Review — Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR

Lite DETR, More Attention on High-Level Features

Outline

1. Problem in DETR

2. Lite DETR

2.1. Interleaved Update

2.2. Iterative High-level Feature Cross-Scale Fusion

2.3. Efficient Low-level Feature Cross-Scale Fusion

2.4. Key-aware Deformable Attention (KDA)

3. Results

3.1. Efficiency Improvements on Deformable DETR

3.2. Efficiency Improvements on Other DETR-based Models

3.3. Visualization of KDA

Written by Sik-Ho Tsang

No responses yet