Review — DETR: End-to-End Object Detection with Transformers

Anchor Free Object Detection Using ResNet+Transformer

8 min readApr 24, 2022

End-to-End Object Detection with Transformers
DETR, by Facebook AI
2020 ECCV, Over 2000 Citations (Sik-Ho Tsang @ Medium)
Object Detection, Panoptic Segmentation, ResNet, Transformer

DEtection TRansformer or DETR is designed, which proposes a set-based global loss that forces unique predictions via bipartite matching, using a Transformer encoder-decoder architecture.

Outline

DETR: DEtection TRansformer
Object Detection Results
Panoptic Segmentation (PS) Results

1. DETR: DEtection TRansformer

**DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a** **Transformer** **architecture**

In contrast to Faster R-CNN and YOLO series which needs region proposals and anchors respectively, DEtection TRansformer (DETR) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.

1.1. Object Detection Set Prediction Loss

DETR infers a fixed-size set of N predictions, in a single pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image.
One of the main difficulties of training is to score predicted objects (class, position, size) with respect to the ground truth. The proposed loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses.
Let y be the ground truth set of objects, and ^y = {^yi}, where i is from 1 to N, be the predicted set. Thus, y can be empty set for no object.
The first step is to find a bipartite matching between these two sets, a permutation of N elements σ is searched with the lowest cost:

where Lmatch(yi, ^y(i)) is a pair-wise matching cost between ground truth yi and a prediction with index σ(i).
The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes: yi={bi, ci}.
bi is a vector that defines ground truth box center coordinates and its height and width relative to the image size, i.e. range of 0 to 1.
ci is the class probability.
Thus, Lmatch(yi, ^y(i)) is:

The main difference to region proposal and anchor-based approaches is that DETR needs to find one-to-one matching for direct set prediction without duplicates.
The second step is to compute the Hungarian loss, which is a linear combination of a negative log-likelihood for class prediction and a box loss defined later:

where ^σ is the optimal assignment computed in the first step. The log-probability term is down-weighted by a factor 10 when ci is empty set to tackle class imbalance issue.
The bounding box loss is a linear combination of the L1 loss and the generalized IoU (GIoU) loss:

L1 loss alone is not good because L1 loss will have different scales for small and large boxes even if their relative errors are similar.

1.2. DETR Architecture

DETR uses a conventional CNN backbone to learn a 2D representation of an input image. Typically, the output has channel number C=2048 and H,W=H0/32, W0/32 where H0 and W0 are the image height & width.
The backbone is ImageNet-pretrained ResNet-50 or ResNet-101 with frozen batch norm. The corresponding models are called respectively DETR and DETR-R101.
The feature resolution can be increased by adding a dilation to the last stage of backbone. The corresponding models are called respectively DETR-DC5 and DETR-DC5-R101 (dilated C5 stage).
The model flattens it and supplements it with a positional encoding before passing it into a Transformer encoder.
A model with 6 encoder and 6 decoder layers of width 256 with 8 attention heads is used.
A Transformer decoder then takes as input a small fixed number of learned positional embeddings, called object queries, and additionally attends to the encoder output.

Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.

Each output embedding of the decoder is passed to a shared feed forward network (FFN) (a 3-layer perceptron with ReLU activation function and hidden dimension d) that predicts either a detection (class and bounding box) or a “no object” class, i.e. the normalized center coordinates, height and width of the box w.r.t. the input image, and the class label using a softmax function.

1.3. Auxiliary Decoding Losses

auxiliary losses are added in decoder during training, especially to help the model output the correct number of objects of each class.
Prediction FFNs and Hungarian loss are added after each decoder layer. All predictions FFNs share their parameters. An additional shared Layer Norm is used to normalize the input to the prediction FFNs from different decoder layers.

2. Object Detection Results

2.1. Comparisons with Faster R-CNN

**Comparison with** **Faster R-CNN** **with a** **ResNet-50 and** **ResNet-101 backbones on the COCO validation set (‘+’: 9× schedule)**

Training the baseline model for 300 epochs on 16 V100 GPUs takes 3 days, with 4 images per GPU (hence a total batch size of 64).

DETR can be competitive with Faster R-CNN with the same number of parameters, achieving 42 AP on the COCO val subset.

The way DETR achieves this is by improving APL (+7.8), however note that the model is still lagging behind in APS (-5.5).
DETR-DC5 with the same number of parameters and similar FLOP count has higher AP, but is still significantly behind in APS too.

2.2. Transformer Layers

For the study, ResNet-50-based DETR model with 6 encoder, 6 decoder layers and width 256, is used. The model has 41.3M parameters, achieves 40.6 and 42.0 AP on short and long schedules respectively, and runs at 28 FPS, similarly to Faster R-CNN-FPN.

Without encoder layers, overall AP drops by 3.9 points, with a more significant drop of 6.0 AP on large objects. It is hypothesized that, by using global scene reasoning, the encoder is important for disentangling objects.

2.3. Encoder Attention

**Encoder self-attention for a set of reference points. The encoder is able to separate individual instances.**

The above figure visualizes the attention maps of the last encoder layer of a trained model, focusing on a few points in the image.

The encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.

**AP and AP50 performance after each decoder layer**

Both AP and AP50 improve after every layer, totalling into a very significant +8.2/9.5 AP improvement between the first and the last layer.
A standard NMS procedure is run after each decoder, the improvement brought by NMS diminishes as depth increases.

It is conjectured that in the second and subsequent layers, the self-attention mechanism over the activations allows the model to inhibit duplicate predictions.

2.4. Decoder Attention

**Visualizing decoder attention for every predicted object (images from COCO val set).**

Decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs. It is hypothesized that after the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.

2.5. FFN

Authors attempt to remove FFN completely leaving only attention in the Transformer layers. By reducing the number of network parameters from 41.3M to 28.7M, leaving only 10.8M in the Transformer, performance drops by 2.3 AP.

It is concluded that FFN are important for achieving good results.

2.6. Position Encoding

**Results for different positional encodings compared to the baseline (last row)**

There are two kinds of positional encodings in our model: spatial positional encodings and output positional encodings (object queries).
Spatial positional encodings are completely removed and output positional encodings are passed at input and, interestingly, the model still achieves more than 32 AP, losing 7.8 AP to the baseline.
Surprisingly, it is found that not passing any spatial encodings in the encoder only leads to a minor AP drop of 1.3 AP.

The global self-attention in encoder, FFN, multiple decoder layers, and positional encodings, all significantly contribute to the final object detection performance.

2.7. Loss

There are three components to the loss: classification loss, l1 bounding box distance loss, and GIoU loss.

Using l1 without GIoU shows poor results.

3. Panoptic Segmentation (PS) Results

3.1. DETR Model for Panoptic Segmentation

Similarly to the extension of Faster R-CNN to Mask R-CNN, DETR can be naturally extended by adding a mask head on top of the decoder outputs.
A binary mask is generated in parallel for each detected object, then the masks are merged using pixel-wise argmax.
To make the final prediction and increase the resolution, an FPN-like architecture is used.
DETR is trained for boxes only, then all the weights are frozen and only the mask head is trained for 25 epochs.

3.2. SOTA Comparison

**Comparison with the state-of-the-art methods** **UPSNet** **and** **Panoptic FPN** **on the COCO val dataset**

DETR outperforms published results on COCO-val 2017, as well as UPSNet & the strong Panoptic FPN baseline.
DETR is especially dominant on stuff classes, and it is hypothesized that the global reasoning allowed by the encoder attention is the key element to this result.
(Please read PS for PQ, SQ, and RQ metrics.)