Review — ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

VilBERT (Vision-and-Language )

Sik-Ho Tsang
5 min readAug 27, 2022

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, ViLBERT, by Georgia Institute of Technology, Oregon State University, and Facebook AI Research
2019 NeurIPS, Over 1400 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model (VLM), ,

  • ViLBERT (short for Vision-and-Language ) is proposed for learning task-agnostic joint representations of image content and natural language.
  • The popular architecture is extended to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional layers.

Outline

  1. ViLBERT (Vision-and-Language )
  2. Training Tasks and Objectives
  3. Experimental Results

1. ViLBERT (Vision-and-Language )

ViLBERT model consists of two parallel streams for visual (green) and linguistic (purple) processing that interact through novel co-attentional layers (Dashed boxes with multiplier subscripts denote repeated blocks of layers.)

1.1. ViLBERT: Extending to Jointly Represent Images and Text

  • ViLBERT model consists of two parallel -style streams for visual (green) and linguistic (purple) processing that interact through novel co-attentional layers.
  • This structure allows for variable depths for each modality and enables sparse interaction through co-attention.
  • Each stream is a series of blocks (TRM) and novel co-attentional layers (Co-TRM).

Given an image I represented as a set of region features v1, .., vT and a text input w0, .., wT, the model outputs final representations hv0, …, hvT and hw0, …, hwT.

1.2. Co-Attentional Layers

(a) Standard encoder block b) The proposed co-attention layer
  • Given intermediate visual and linguistic representations H(i)V and H(j)W, the module computes query, key, and value matrices as in a standard block. However, the keys and values from each modality are passed as input to the other modality’s multi-headed attention block.

The attention block produces attention-pooled features for each modality conditioned on the other — in effect performing image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream.

2. Training Tasks and Objectives

ViLBERT is trained on the dataset under two training tasks to learn visual grounding
  • Two pretraining tasks are considered: masked multi-modal modelling and multi-modal alignment prediction.

2.1. Masked Multi-Model Modelling

  • Masking approximately 15% of both words and image region inputs and tasking the model with reconstructing them given the remaining inputs.
  • Masked image regions have their image features zeroed out 90% of the time and are unaltered 10%.
  • Masked text inputs are handled as in .
  • Rather than directly regressing the masked feature values, the model instead predicts a distribution over semantic classes for the corresponding image region. To supervise this, the output distribution for the region is taken from the same pretrained detection model used in feature extraction. The model is trained to minimize the KL divergence between these two distributions.

2.2. Multi-Modal Alignment Prediction

  • The model is presented an image-text pair as:
  • and must predict whether the image and text are aligned, i.e. whether the text describes the image.
  • The outputs hIMG and hCLS are taken as holistic representations of the visual and linguistic inputs. The overall representation is taken as an element-wise product between hIMG and hCLS and a linear layer is learnt to make the binary prediction whether the image and text are aligned.
  • However, the dataset only includes aligned image-caption pairs. To generate negatives for an image-caption pair, either the image or caption is randomly replaced with another.

2.3. Training VILBERT

  • dataset is used for training.
  • For linguistic stream, BASE model is used, which has 12 layers of blocks with each block having a hidden state size of 762 and 12 attention heads.
  • It is pretrained on the BookCorpus and English Wikipedia.
  • For visual stream, (with -101 backbone) is used, which pretrained on the Visual Genome dataset to extract region features.
  • Regions are selected where class detection probability exceeds a confidence threshold and keep between 10 to 36 high-scoring boxes.
  • For each selected region i, vi is defined as the mean-pooled convolutional feature from that region.
  • and co-attentional blocks in the visual stream have hidden state size of 1024 and 8 attention heads.
  • 8 TitanX GPUs with a total batch size of 512 are used for training 10 epochs.

3. Experimental Results

3.1. SOTA Comparisons

Transfer task results for ViLBERT model compared with existing state-of-the-art and sensible architectural ablations
  • Single-Stream consisting of a single architecture that processes both modality inputs through the same set of transformer blocks — sharing parameters and processing stacks for both visual and linguistic inputs.

VILBERT improves performance over a single-stream model.

  • The proposed models further improve by between 2% and 13% across tasks when using a ViLBERT model that has been pretrained under our proxy tasks (ViLBERT vs ViLBERT+).

Pretraining tasks result in improved visiolinguistic representations.

3.1. Effect of Visual Stream Depth

Ablation study of the depth of ViLBERT model with respect to the number of Co-TRM→TRM blocks

VQA and Image Retrieval tasks benefit from greater depth — performance increases monotonically until a layer depth of 6.

3.2. Benefits of Large Training Sets

Transfer task results for ViLBERT as a function of the percentage of the dataset used during pre-training

Accuracy grows monotonically as the amount of data increases from 0% to 100%.

VILBERT, a task-agnostic -style pretraining architecture, extracting features in visual and linguistic streams, and attending to each other, is designed.

Reference

[2019 NeurIPS] [ViLBERT]

5.1. Visual/Vision/Video Language Model (VLM)

2019 [] [] [] [ViLBERT] 2020 []

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response