MDETR — Modulated Detection for End-to-End Multi-Modal Understanding

MDETR, Extend DETR for Phase Grounding

Sik-Ho Tsang
7 min readJun 11


MDETR, Able to Detect a Pink Elephant

MDETR — Modulated Detection for End-to-End Multi-Modal Understanding,
MDETR, by NYU Center for Data Science, Facebook AI Research, and NYU Courant Institute,
2021 ICCV (Sik-Ho Tsang @ Medium)

Object Detection
2014 … 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] 2022 [PVTv2] [Pix2Seq] [MViTv2] 2023 [YOLOv7]
Visual/Vision/Video Language Model (VLM)
2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • MDETR is proposed, which an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question.
  • A Transformer-based architecture is used to reason jointly over text and image by fusing the two modalities at an early stage of the model.
  • This is a paper from Yann LeCun.


  1. Brief Review of DETR
  2. MDETR: Model Architecture
  3. MDETR: Training
  4. Results

1. Brief Review of DETR

1.1. Model Architecture

  • DETR is an end-to-end detection model composed of a backbone (typically a convolutional residual network), followed by a Transformer Encoder-Decoder.
  • DETR encoder operates on 2D flattened image features from the backbone and applies a series of Transformer layers. The decoder takes as input a set of N learned embeddings called object queries, that can be viewed as slots that the model needs to fill with detected objects.
  • All the object queries are fed in parallel to the decoder, which uses cross-attention layers to look at the encoded image and predicts the output embeddings for each of the queries.
  • The final representation of each object query is independently decoded into box coordinates and class labels using a shared feed-forward layer.

1.2. Loss Functions

  • Each matched object is supervised using the corresponding target as groundtruth, while the un-matched objects are supervised to predict the “no object” label ∅.
  • The classification head is supervised using standard cross-entropy, while the bounding box head is supervised using a combination of absolute error (L1 loss) and Generalized IoU (GIoU).
  • (Please feel free to read DETR for more details.)

2. MDETR: Model Architecture

  • As in DETR, the image is encoded by a convolutional backbone and flattened. In order to conserve the spatial information, 2-D positional embeddings are added onto the image vector. ResNet and EfficientNet are used.
  • The text is encoded using a pre-trained Transformer language model, RoBERTa, to produce a sequence of hidden vectors of same size as the input.
  • Then a modality dependent linear projection is applied to both the image and text features to project them into a shared embedding space. These feature vectors are then concatenated on the sequence dimension to yield a single sequence of image and text features.
  • This sequence is fed to a joint Transformer encoder termed as the cross encoder. Following DETR, a Transformer decoder is applied on the object queries while cross attending to the final hidden state of the cross encoder.
  • The decoder’s output is used for predicting the actual boxes.

3. MDETR: Training

3.1. Soft Token Prediction

Soft Token Prediction
  • The span of tokens from the original text are predicted that refers to each matched object. Concretely, the maximum number of tokens is first set for any given sentence to be L=256.
  • For each predicted box that is matched to a ground truth box using the bi-partite matching, the model is trained to predict a uniform distribution over all token positions that correspond to the object.

3.2. Contrastive Alignment

  • Consider the maximum number of tokens to be L and maximum number of objects to be N.
  • Let T+i be the set of tokens that a given object oi should be aligned to, and O+i be the set of objects to be aligned with a given token ti.
  • The contrastive loss for all objects, inspired by InfoNCE (CPC), is normalized by number of positive tokens for each object and can be written as follows:
  • where τ=0.07 is a temperature parameter.
  • By symmetry, the contrastive loss for all tokens, normalized by the number of positive objects for each token is given by:
  • The average of these two loss functions is used as contrastive alignment loss.

3.3. Total Loss

  • The main difference from DETR is that there is no class label predicted for each object - instead predicting a uniform distribution over the relevant positions in the text that correspond to this object (soft token predictions), supervised using a soft cross entropy.
  • The matching cost consists of this in addition to the L1 & GIoU loss between the prediction and the target box as in DETR.
  • After matching, the total loss consists of the box prediction losses (L1 & GIoU), soft-token prediction loss, and the contrastive alignment loss.

3.4. Pretraining Dataset

  • A combined dataset is created using images from the Flickr30k [40], MS COCO [26] and Visual Genome (VG) [20] datasets.
  • An image may have several text annotations associated with it so that data efficiency is increased by packing more information into a single training example.
  • The network is pretrained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets, as above, having explicit alignment between phrases in text and objects in the image. Then, it is fine-tuned on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, as below.

4. Results

4.1. CLEVR

  • The CLEVR dataset consists of 3D-rendered scenes containing between 3 and 10 objects of various shapes, material, size and color.
  • Each of these scenes is associated with about 10 questions that are formulated about the visible objects, generated from a fixed set of templates.
  • The total training set contains 70k images and slightly less than 700k questions.

MDETR obtains comparable or higher performanace compared with other SOTA approaches. Particularly, MDETR achieves 100% accuracy on CLEVR-Ref+.

4.2. Phase Grounding

Phase Grounding
  • The task is to provide a set of bounding boxes for each phrase.

With pre-training, MDETR further obtains a 12.1 point boost over the best model’s performance on the test set, while using the same backbone.

4.3. Referring Expression Comprehension

Referring Expression Comprehension.
  • Given an image and a referring expression in plain text, the task is to localize the object being referred to by returning a bounding box around it.

There are large improvements over state-of-the-art across all datasets.

4.4. PhraseCut

PhraseCut (Extenstion from Bounding Box to Mask)

The model is able to produce clean masks for a wide variety of long tailed-concepts covered by PhraseCut.

4.5. Visual Question Answering

Extend to Visual Question Answering
  • Apart from the 100 queries that are used for detection, additional queries are used that specialize in the type of question as well as one that is used to predict the type of question, where the types are defined in the GQA annotations as REL, OBJ, GLOBAL, CAT and ATTR.
Visual Question Answering

Using MDETR with a ResNet-101 backbone, MDETR not only outperforms LXMERT and VL-T5 [7] which use comparable amount of data, but also OSCAR which uses magnitude more data in their pre-training.

MDETR with the EfficientNet-B5 backbone is able to push performance even higher.

4.6. Few-Shot Transfer for Long-Tailed Detection

  • MDETR is fine-tuned on three subsets of the LVIS train set, each containing respectively 1%, 10% and 100% of training set.

Even with as little as 1 example per class, MDETR leverages the text pre-training and outperforms a fully fine-tuned DETR on rare categories.

4.7. Visualizations

Visualizations of Some Examples
  • The above figures show some examples of DETR.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.