Review — LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT, A Vision Language Model for VQA, GQA, NLVR²

6 min readAug 6, 2022

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, LXMERT, by UNC Chapel Hill
2019 EMNLP, Over 1000 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model, BERT, Transformer, Faster R-CNN

LXMERT (Learning Cross-Modality Encoder Representations from Transformers, pronounced: ‘leksmert’) pretraining framework is proposed to learn the vision-and-language connections.
After learning both intra-modality and cross-modality relationships, it is used for fine-tuning.

Outline

LXMERT Model Architecture
LXMERT Pretraining Losses
LXMERT Pretraining Datasets & Details
Experimental Results

1. LXMERT Model Architecture

**LXMERT for learning vision-and-language cross-modality representations. ‘Self’** and ‘Cross’ are abbreviations for self-attention sub-layers and cross-attention sub-layers, respectively. ‘FF’ denotes a **feed-forward** sub-layer.

The above figure shows the overall framework. Below sub-sections will be described part by part. (It is assumed BERT and Transformer are known before reading this.)

1.1. Inputs

1.1.1. Word-Level Sentence Embeddings

A sentence is split with words {w1, …, wn} with length n by WordPiece tokenizer used in BERT.
The word wi and its index i (wi’s absolute position in the sentence) are projected to vectors by embedding sub-layers, and then added to the index-aware word embeddings:

1.1.2. Object-Level Image Embeddings

Faster R-CNN is used to detect m objects {o1, …, om}. Then the features of detected objects are used as the embeddings of images. Each object oj is represented by its position feature (i.e., bounding box coordinates) pj and its 2048-dimensional region-of-interest (RoI) feature fj. A position-aware embedding vj is learnt by adding outputs of 2 fully-connected layers:

1.2. Single-Modality Encoders

Each layer in a single-modality encoder contains a self-attention (‘Self’) sub-layer and a feed-forward (‘FF’) sub-layer.
A residual connection and layer normalization (+) are added after each sub-layer.

1.2.1. Language Encoder

There are NL layers in the language encoder.

1.2.2. Object-Relationship Encoder

There are NR layers in the language encoder.

1.3. Cross-Modality Encoder

Each consists of two self-attention sub-layers, one bi-directional cross-attention sublayer, and two feed-forward sub-layers.
There are NX layers.
Inside the k-th layer, the bi-directional cross-attention sub-layer (‘Cross’) is first applied, which contains two unidirectional cross-attention sub-layers, one from language to vision and one from vision to language:

where h and v are referred to language and vision features respectively as mentioned in Section 1.1.

The cross-attention sub-layer is used to exchange the information and align the entities between the two modalities.

For further building internal connections, the self-attention sub-layers (‘Self’) are then further applied:

Lastly, the k-th layer output {hki} and {vkj}are produced by feed-forward sub-layers (‘FF’) on top of {^hki} and {^vkj}.
A residual connection and layer normalization (+) are added after each sub-layer.

1.4. Outputs

LXMERT cross-modality model has three outputs for language, vision, and cross-modality, respectively. For the cross-modality output, following the practice in BERT, a special token [CLS] is appended before the sentence words.

2. LXMERT Pretraining Losses

2.1. Language Task: Masked Cross-Modality LM

Words are randomly masked with p=0.15, which is similar to BERT.

Different from BERT, masked words are predicted from both the non-masked words in the language modality and the visual modality, to resolve ambiguity.

As shown above, it is hard to determine the masked word ‘carrot’ from its language context but the word choice is clear if the visual information is considered.

2.2. Vision Task: Masked Object Prediction

Objects are randomly mask with p=0.15, i.e. RoI features set to 0.

Two sub-tasks are performed: RoI-Feature Regression regresses the object RoI feature fj with L2 loss, and Detected-Label Classification learns the labels of masked objects with cross-entropy loss.

2.3. Cross-Modality Tasks

2.3.1. Cross-Modality Matching

For each sentence, p=0.5, it is replaced with a mismatched sentence.
Then, a classifier is trained to predict whether an image and a sentence match each other. This task is similar to ‘Next Sentence Prediction’ in BERT.

2.3.2. Image Question Answering (QA)

In order to enlarge the pre-training dataset, around 1/3 sentences in the pre-training data are questions about the images.
The model is asked to predict the answer to these image-related questions when the image and the question are matched.

3. LXMERT Pretraining Datasets & Details

3.1. Pretraining Data

Pre-training data are aggregated from five vision-and-language datasets whose images come from MS COCO or Visual Genome. Besides the above two original captioning datasets, three large image question answering (image QA) datasets are aggregated: VQA v2.0, GQA balanced version, and VG-QA. Only train and dev data is used.
Thus, there are in total 9.18M image-and-sentence pairs on 180K distinct images. The pre-training data contain around 100M words and 6.5M image objects.

3.2. Pretraining Details

Faster R-CNN is pretrained on Visual Genome, and then frozen as feature extractor. Only m=36 objects are kept.
The numbers of layers NL, NX, and NR are set to 9, 5, and 5 respectively.
The hidden size is 768, and pretrained from scratch.
LXMERT is pre-trained with multiple pre-training tasks and hence multiple losses are involved. Equal weights are used.

4. Experimental Results

4.1. SOTA Comparisons

**Test-set Results on VQA, GQA, and NLVR²**

On NLVR², there are 2 images for 1 sentence. The task is to predict the label y given the images and the statement. Thus, x0 and x1 which are output from LXMERT, are concatenated, input to a weight layer, GELU, layer norm, then another weight layer and softmax:

On VQA, LXMERT improves the SotA overall accuracy (‘Accu’) by 2.1% and has 2.4% improvement on the ‘Binary’/‘Other’ question sub-categories.
On VQA, there are 3.2% accuracy gain over the SotA.
On NLVR², LXMERT significantly improves the accuracy (‘Accu’ of 76.2%) by 22%.

4.2. Comparison with BERT

**Dev-set accuracy compared with** **BERT**

Try loading BERT parameters into LXMERT, and use it in model training. It shows weaker results than the full model.

4.3. Ablation Study

**Dev-set accuracy showing the importance of the image-QA pre-training task**. QA10 means pretraining using QA data for 10 epochs

The 2.1% improvement on NLVR² shows the stronger representations learned with image-QA pre-training, since all data (images and statements) in NLVR² are not used in pre-training.

**Dev-set accuracy of different vision pretraining tasks.** ‘Feat’ is RoI-feature regression; ‘Label’ is detected-label classification.

The two visual pre-training tasks (i.e., RoI-feature regression and detected-label classification) could get reasonable results on their own, and jointly pre-training with these two tasks achieves the highest results.

[2019 EMNLP] [LXMERT]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Visual/Vision/Video Language Model (VLM)

2019 [VideoBERT] [VisualBERT] [LXMERT] 2020 [ConVIRT]

Review — LXMERT: Learning Cross-Modality Encoder Representations from Transformers

LXMERT, A Vision Language Model for VQA, GQA, NLVR²

Outline

1. LXMERT Model Architecture

1.1. Inputs

1.1.1. Word-Level Sentence Embeddings

1.1.2. Object-Level Image Embeddings

1.2. Single-Modality Encoders

1.2.1. Language Encoder

1.2.2. Object-Relationship Encoder

1.3. Cross-Modality Encoder

1.4. Outputs

2. LXMERT Pretraining Losses

2.1. Language Task: Masked Cross-Modality LM

2.2. Vision Task: Masked Object Prediction

2.3. Cross-Modality Tasks

2.3.1. Cross-Modality Matching

2.3.2. Image Question Answering (QA)

3. LXMERT Pretraining Datasets & Details

3.1. Pretraining Data

3.2. Pretraining Details

4. Experimental Results

4.1. SOTA Comparisons

4.2. Comparison with BERT

4.3. Ablation Study

Visual/Vision/Video Language Model (VLM)

My Other Previous Paper Readings

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet