Review — LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT, A Vision Language Model for VQA, GQA, NLVR²

Sik-Ho Tsang
6 min readAug 6, 2022

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, LXMERT, by UNC Chapel Hill
2019 EMNLP, Over 1000 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model, BERT, Transformer, Faster R-CNN

  • LXMERT (Learning Cross-Modality Encoder Representations from Transformers, pronounced: ‘leksmert’) pretraining framework is proposed to learn the vision-and-language connections.
  • After learning both intra-modality and cross-modality relationships, it is used for fine-tuning.


  1. LXMERT Model Architecture
  2. LXMERT Pretraining Losses
  3. LXMERT Pretraining Datasets & Details
  4. Experimental Results

1. LXMERT Model Architecture

LXMERT for learning vision-and-language cross-modality representations. ‘Self’ and ‘Cross’ are abbreviations for self-attention sub-layers and cross-attention sub-layers, respectively. ‘FF’ denotes a feed-forward sub-layer.
  • The above figure shows the overall framework. Below sub-sections will be described part by part. (It is assumed BERT and Transformer are known before reading this.)

1.1. Inputs

1.1.1. Word-Level Sentence Embeddings

Word-Level Sentence Embeddings
  • A sentence is split with words {w1, …, wn} with length n by WordPiece tokenizer used in BERT.
  • The word wi and its index i (wi’s absolute position in the sentence) are projected to vectors by embedding sub-layers, and then added to the index-aware word embeddings:

1.1.2. Object-Level Image Embeddings

Object-Level Image Embeddings
  • Faster R-CNN is used to detect m objects {o1, …, om}. Then the features of detected objects are used as the embeddings of images. Each object oj is represented by its position feature (i.e., bounding box coordinates) pj and its 2048-dimensional region-of-interest (RoI) feature fj. A position-aware embedding vj is learnt by adding outputs of 2 fully-connected layers:

1.2. Single-Modality Encoders

  • Each layer in a single-modality encoder contains a self-attention (‘Self’) sub-layer and a feed-forward (‘FF’) sub-layer.
  • A residual connection and layer normalization (+) are added after each sub-layer.

1.2.1. Language Encoder

Language Encoder
  • There are NL layers in the language encoder.

1.2.2. Object-Relationship Encoder

Object-Relationship Encoder
  • There are NR layers in the language encoder.

1.3. Cross-Modality Encoder

Cross-Modality Encoder
  • Each consists of two self-attention sub-layers, one bi-directional cross-attention sublayer, and two feed-forward sub-layers.
  • There are NX layers.
  • Inside the k-th layer, the bi-directional cross-attention sub-layer (‘Cross’) is first applied, which contains two unidirectional cross-attention sub-layers, one from language to vision and one from vision to language:
  • where h and v are referred to language and vision features respectively as mentioned in Section 1.1.

The cross-attention sub-layer is used to exchange the information and align the entities between the two modalities.

  • For further building internal connections, the self-attention sub-layers (‘Self’) are then further applied:
  • Lastly, the k-th layer output {hki} and {vkj}are produced by feed-forward sub-layers (‘FF’) on top of {^hki} and {^vkj}.
  • A residual connection and layer normalization (+) are added after each sub-layer.

1.4. Outputs

  • LXMERT cross-modality model has three outputs for language, vision, and cross-modality, respectively. For the cross-modality output, following the practice in BERT, a special token [CLS] is appended before the sentence words.

2. LXMERT Pretraining Losses

LXMERT Pretraining

2.1. Language Task: Masked Cross-Modality LM

  • Words are randomly masked with p=0.15, which is similar to BERT.

Different from BERT, masked words are predicted from both the non-masked words in the language modality and the visual modality, to resolve ambiguity.

  • As shown above, it is hard to determine the masked word ‘carrot’ from its language context but the word choice is clear if the visual information is considered.

2.2. Vision Task: Masked Object Prediction

  • Objects are randomly mask with p=0.15, i.e. RoI features set to 0.

Two sub-tasks are performed: RoI-Feature Regression regresses the object RoI feature fj with L2 loss, and Detected-Label Classification learns the labels of masked objects with cross-entropy loss.

2.3. Cross-Modality Tasks

2.3.1. Cross-Modality Matching

  • For each sentence, p=0.5, it is replaced with a mismatched sentence.
  • Then, a classifier is trained to predict whether an image and a sentence match each other. This task is similar to ‘Next Sentence Prediction’ in BERT.

2.3.2. Image Question Answering (QA)

  • In order to enlarge the pre-training dataset, around 1/3 sentences in the pre-training data are questions about the images.
  • The model is asked to predict the answer to these image-related questions when the image and the question are matched.

3. LXMERT Pretraining Datasets & Details

3.1. Pretraining Data

  • Pre-training data are aggregated from five vision-and-language datasets whose images come from MS COCO or Visual Genome. Besides the above two original captioning datasets, three large image question answering (image QA) datasets are aggregated: VQA v2.0, GQA balanced version, and VG-QA. Only train and dev data is used.
  • Thus, there are in total 9.18M image-and-sentence pairs on 180K distinct images. The pre-training data contain around 100M words and 6.5M image objects.

3.2. Pretraining Details

  • Faster R-CNN is pretrained on Visual Genome, and then frozen as feature extractor. Only m=36 objects are kept.
  • The numbers of layers NL, NX, and NR are set to 9, 5, and 5 respectively.
  • The hidden size is 768, and pretrained from scratch.
  • LXMERT is pre-trained with multiple pre-training tasks and hence multiple losses are involved. Equal weights are used.

4. Experimental Results

4.1. SOTA Comparisons

Test-set Results on VQA, GQA, and NLVR²
  • On NLVR², there are 2 images for 1 sentence. The task is to predict the label y given the images and the statement. Thus, x0 and x1 which are output from LXMERT, are concatenated, input to a weight layer, GELU, layer norm, then another weight layer and softmax:
  • On VQA, LXMERT improves the SotA overall accuracy (‘Accu’) by 2.1% and has 2.4% improvement on the ‘Binary’/‘Other’ question sub-categories.
  • On VQA, there are 3.2% accuracy gain over the SotA.
  • On NLVR², LXMERT significantly improves the accuracy (‘Accu’ of 76.2%) by 22%.

4.2. Comparison with BERT

Dev-set accuracy compared with BERT
  • Try loading BERT parameters into LXMERT, and use it in model training. It shows weaker results than the full model.

4.3. Ablation Study

Dev-set accuracy showing the importance of the image-QA pre-training task. QA10 means pretraining using QA data for 10 epochs
  • The 2.1% improvement on NLVR² shows the stronger representations learned with image-QA pre-training, since all data (images and statements) in NLVR² are not used in pre-training.
Dev-set accuracy of different vision pretraining tasks. ‘Feat’ is RoI-feature regression; ‘Label’ is detected-label classification.
  • The two visual pre-training tasks (i.e., RoI-feature regression and detected-label classification) could get reasonable results on their own, and jointly pre-training with these two tasks achieves the highest results.

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Visual/Vision/Video Language Model (VLM)

2019 [VideoBERT] [VisualBERT] [LXMERT] 2020 [ConVIRT]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.