Review — Deformable Transformers for End-to-End Object Detection

Deformable DETR, Faster Convergence

7 min readJul 14, 2022

Deformable Transformers for End-to-End Object Detection
Deformable DETR, by SenseTime Research, University of Science and Technology of China, and The Chinese University of Hong Kong
2021 NeurIPS, Over 800 Citations (Sik-Ho Tsang @ Medium)
Object Detection, Transformer, DETR

Original DETR suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules.
Deformable DETR, is proposed, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs.

Outline

Preliminaries
Deformable Attention Module
Multi-Scale Deformable Attention Module
Deformable DETR
Experimental Results

1. Preliminaries

1.1. Multi-Head Attention in Transformers

The multi-head attention module adaptively aggregates the key contents according to the attention weights that measure the compatibility of query-key pairs.
The multi-head attention feature is calculated by:

where m indexes the attention head. W’m and Wm are of learnable weights. Amqk are the attention weight where:

in which Um and Vm are also learnable weights.
However, long training schedules are required so that the attention weights can focus on specific keys.
And the computational and memory complexity for multi-head attention can be very high with numerous query and key elements.

1.2. DETR

So, the computational complexity of self-attention grows quadratically with the spatial size.
DETR has relatively low performance in detecting small objects.
Compared with modern object detectors, DETR requires many more training epochs to converge.
(It’s better to read DETR first.)

2. Deformable Attention Module

Given an input feature map x with size of CHW, let q index a query element with content feature zq and a 2-d reference point pq, the deformable attention feature is calculated by:

where m indexes the attention head (M is number of heads), k indexes the sampled keys, and K is the total sampled key number (K≤HW).

Δpmqk and Amqk denote the sampling offset and attention weight of the kth sampling point in the mth attention head, respectively.
As pq+Δpmqk is fractional, bilinear interpolation is applied.
Both Δpmqk and Amqk are obtained via linear projection over the query feature zq.
The query feature zq is fed to a linear projection operator of 3MK channels, where the first 2MK channels encode the sampling offsets Δpmqk, and the remaining MK channels are fed to a softmax operator to obtain the attention weights Amqk.

To be brief, two sets of channels are used to encode offsets in x and y directions. The remaining one set of channels is to encode attention weight.
These two sets of offsets are learnt, which has the similar concept in DCN.

Let Nq be the number of query elements, when MK is relatively small, the complexity of the deformable attention module is of:

When it is applied in DETR encoder, where Nq=HW, the complexity becomes O(HWC²), which is of linear complexity with the spatial size.
When it is applied as the cross-attention modules in DETR decoder, where Nq=N (N is the number of object queries), the complexity becomes O(NKC²), which is irrelevant to the spatial size HW.

3. Multi-Scale Deformable Attention Module

Multi-scale deformable attention modules to replace the Transformer attention modules processing feature maps.

Let {xl}, where l from 1 to L, be the input multi-scale feature maps, where xl has the size of C×Hl×Wl. Let ^pq ∈ [0, 1]² be the normalized coordinates of the reference point for each query element q, then the multi-scale deformable attention module is applied as:

The normalized coordinates (0, 0) and (1, 1) indicate the top-left and the bottom-right image corners, respectively. Φl(^pq) re-scales the normalized coordinates ^pq to the input feature map of the l-th level.
The multi-scale deformable attention is very similar to the previous single-scale version, except that it samples LK points from multi-scale feature maps instead of K points from single-scale feature maps.
The proposed attention module will degenerate to deformable convolution, as in DCN, when L=1, K=1, and W’m is fixed as an identity matrix.

The proposed (multi-scale) deformable attention module can also be perceived as an efficient variant of Transformer attention, where a pre-filtering mechanism is introduced by the deformable sampling locations.

4. Deformable DETR

**Deformable** **DETR** **Object Detector**

4.1. Deformable Transformer Encoder

**Constructing multi-scale feature maps for Deformable** **DETR**

In encoder, multi-scale feature maps {xl} where l is from 1 to L-1 (L=4), are extracted from the output feature maps of stages C3 through C5 in ResNet (transformed by a 1×1 convolution), where Cl is of resolution 2^l lower than the input image.
The lowest resolution feature map xL is obtained via a 3×3 stride 2 convolution on the final C5 stage, denoted as C6. All the multi-scale feature maps are of C=256.
FPN is not used as it is not effective.
(Some more details below. Skip for quick read.)
The output are of multiscale feature maps with the same resolutions as the input.
A scale-level embedding, denoted as el, is added to the feature representation, in addition to the positional embedding.

4.2. Deformable Transformer Decoder

(Some more details below. Please skip for quick read.)
There are cross-attention and self-attention modules in the decoder.
In the cross-attention modules, object queries extract features from the feature maps, where the key elements are of the output feature maps from the encoder.
In the self-attention modules, object queries interact with each other, where the key elements are of the object queries.
Multi-scale deformable attention module only replaces the cross-attention module.
The 2-d normalized coordinate of the reference point ^pq is predicted from its object query embedding via a learnable linear projection followed by a sigmoid function.
Because the multi-scale deformable attention module extracts image features around the reference point, the detection head only predicts the bounding box as relative offsets w.r.t. the reference point to further reduce the optimization difficulty, which also accelerates the training convergence.

4.3. Additional Improvements and Variants of DETR

4.3.1. Iterative Bounding Box Refinement

Each decoder layer refines the bounding boxes based on the predictions from the previous layer.

4.3.2. Two-Stage Deformable DETR

Inspired by two-stage object detectors, a variant of Deformable DETR is used for generating region proposals as the first stage. An encoder-only Deformable DETR is used for region proposal generation.
In it, each pixel is assigned as an object query, which directly predicts a bounding box. Top scoring bounding boxes are picked as region proposals.
The generated region proposals will be fed into the decoder as object queries for further refinement, forming a two-stage Deformable DETR.

5. Experimental Results

5.1. Comparisons with DETR

**Convergence curves of Deformable** **DETR** **and** **DETR-DC5 on COCO 2017 val set**

**Comparision of Deformable** **DETR** **with** **DETR** **on COCO 2017 val set**

Compared with Faster R-CNN+ FPN, DETR requires many more training epochs to converge, and delivers lower performance at detecting small objects.

Compared with DETR, Deformable DETR achieves better performance (especially on small objects) with 10× less training epochs.

Deformable DETR has on par FLOPs with Faster R-CNN + FPN and DETR-DC5. But the runtime speed is much faster (1.6×) than DETR-DC5, and is just 25% slower than Faster R-CNN + FPN. The speed issue of DETR-DC5 is mainly due to the large amount of memory access in Transformer attention.

5.2. Ablation Study

**Ablations for deformable attention on COCO 2017 val set**

Using multi-scale inputs instead of single-scale inputs can effectively improve detection accuracy with 1.7% AP, especially on small objects with 2.9% APS.
Increasing the number of sampling points K can further improve 0.9% AP. Using multi-scale deformable attention, which allows information exchange among different scale levels, can bring additional 1.5% improvement in AP.
Because the cross-level feature exchange is already adopted, adding FPNs will not improve the performance.
When multi-scale attention is not applied, and K=1, (multi-scale) deformable attention module degenerates to deformable convolution, delivering noticeable lower accuracy.

5.3. SOTA Comparisons

**Comparison of Deformable** **DETR** **with state-of-the-art methods on COCO 2017 test-dev set**

With ResNet-101 and ResNeXt-101, the proposed method achieves 48.7 AP and 49.0 AP without bells and whistles, respectively.
By using ResNeXt-101 with DCNv2, the accuracy rises to 50.1 AP.
With additional test-time augmentations, the proposed method achieves 52.3 AP.