Review — FLAVA: A Foundational Language And Vision Alignment Model

Foundation Model for Pure Vision, Pure Language, & Multi-Modal Language And Vision Tasks

Sik-Ho Tsang
8 min readMar 21


FLAVA, learns strong representations from multimodal (image-text pairs) and unimodal data (unpaired images and text), and is applied to target a broad scope of tasks from three domains (visual recognition, language understanding, and multimodal reasoning) under a common Transformer model architecture.

FLAVA: A Foundational Language And Vision Alignment Model,
FLAVA, by Facebook AI Research (FAIR),
2022 CVPR, Over 110 Citations (Sik-Ho Tsang @ Medium)
Image-Text Foundation Model, Vision Language Model, Visual Language Model, VLM, Transformer, ViT, CLIP

Visual/Vision/Video Language Model (VLM)
2022 [FILIP] [Wukong] [LiT] [Flamingo] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • Foundation model or VLM is a very hot topic recently due to the recent GPT-4. Today, FLAVA, a Foundational Language And Vision Alignment model, is introduced.
  • FLAVA uses a single holistic universal model, as a “foundation”, that targets all modalities at once — a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks.


  1. FLAVA Framework
  2. FLAVA Multimodal Pretraining Objectives
  3. FLAVA Training Details & Datasets
  4. Results

1. FLAVA Framework

An overview of our FLAVA model

1.1. Image Encoder

  • ViT-B/16 is used.
  • Given an input image, it is resized and split into patches, which are then linearly embedded and fed into a Transformer model (along with positional embeddings and an extra image classification token [CLS_I]).
  • The image encoder output is a list of image hidden state vectors {hI}, each corresponding to an image patch, plus an additional hCLS,I for [CLS_I].

1.2. Text Encoder

  • ViT-B/16 is used, instead of BERT.
  • Given an input piece of text (e.g., a sentence or a pair of sentences), it is first tokenized and embedded into a list of word vectors following BERT.
  • Then, a Transformer model is applied over the word vectors to encode them into a list of hidden state vectors {hT}, including hCLS,T for the text classification [CLS_T] token.

1.3. Multimodal Encoder

  • Two learned linear projections over each hidden state vector in {hI} and {hT}, and they are concatenated into a single list with an additional [CLS_M] token added, which allows cross-attention between the projected unimodal image and text representations and fusing the two modalities.
  • The output from the multimodal encoder is a list of hidden states {hM}, each corresponding to a unimodal vector from {hI} or {hT} (and a vector hCLS,M for [CLS_M]).

1.4. Downstream Tasks

  • For visual recognition tasks (e.g. ImageNet classification), a classifier head (e.g. a linear layer or a multi-layer perceptron) is applied on top of the unimodal hCLS,I from the image encoder.
  • Similarly, for language understanding and multimodal reasoning tasks, a classifier head is applied on top of hCLS,T from the text encoder or hCLS,M from the multimodal encoder, respectively.

2. FLAVA Multimodal Pretraining Objectives

2.1. Global Contrastive (GC) Loss

The image-text contrastive loss resembles that of CLIP. Given a batch of images and text, the cosine similarities between matched image and text pairs are maximized and those for the unmatched pairs are minimized.

  • This is accomplished by linearly projecting each hCLS,I and hCLS,T into an embedding space, followed by L2-normalization, dot-product, and a softmax loss scaled by temperature.
  • Yet, large models are often trained using multiple GPUs data parallelism, where the samples in a batch are split across GPUs.
  • In CLIP, the gradients are only back-propagated from the local GPU.

In this paper, it is found that a noticeable performance gain by performing full backpropagation across GPUs. That’s why it is called Global Contrastive (GC) Loss.

2.2. Masked Multimodal Modeling (MMM)

Given an image and text input, the input image patches are first tokenized using a pretrained dVAE tokenizer, as in DALL·E, which maps each image patch into an index in a visual codebook similar to a word dictionary.

  • Then, a subset of image patches is replaced following BEiT.
  • 15% of text tokens following BERT are replaced with a special [MASK] token.

Then, from the multimodal encoder’s output {hM}, a multilayer perceptron is applied to predict the visual codebook index of the masked image patches, or the word vocabulary index of the masked text tokens.

2.3. Image-Text Matching (ITM)

An image-text matching loss LITM is applied following prior vision-and-language pretraining literature, e.g.: [16], ViLBERT, LXMERT.

  • During pretraining, a batch of samples is fed including both matched and unmatched image-text pairs.

Then, on top of hCLS,M from the multimodal encoder, a classifier is applied to decide if an input image and text match each other.

2.4. Unimodal Pretraining Objectives

  • The vast majority of datasets (such as ImageNet for images and CCNews for text) are unimodal without paired data from the other modality.
  • In this work, knowledge and information is introduced from these unimodal datasets through 1) pretraining the image encoder and text encoder on unimodal datasets; 2) pretraining the entire FLAVA model jointly on both unimodal and multimodal datasets; or 3) a combination of both by starting from pretrained encoders and then jointly training.

2.4.1. Masked Image Modeling (MIM)

  • A set of image patches is masked following the rectangular block-wise masking in BEiT and reconstructed from other image patches.

The input image is first tokenized using a pretrained dVAE tokenizer as in DALL·E, and then a classifier is applied on the image encoder outputs {hI} to predict the dVAE tokens of the masked patches.

2.4.2. Masked Language Modeling (MLM)

  • Masked language modeling loss as in BERT, is applied on top of the text encoder to pretrain on stand-alone text datasets.

A fraction (15%) of the text tokens are masked in the input, and reconstructed from the other tokens using a classifier over the unimodal text hidden states output {hT}.

3. FLAVA Training Details & Datasets

3.1. Encoder Initialization from Unimodal Pretraining

  • Three sources of data are used for pretraining: unimodal image data (ImageNet-1K), unimodal text data (CCNews and BookCorpus), and multimodal image-text paired data.

The text encoder is first pretrained with the MLM objective on the unimodal text dataset.

The image encoder is pretrained on unpaired image datasets with either MIM or DINO objective. The latter works quite well.

Then, the whole FLAVA model is initialized with the two respective unimodally pretrained encoders, or when it is trained from scratch, it is initialized randomly. The multimodal encoder is always intialized randomly for pretraining.

3.2. Joint Unimodal and Multimodal Training

  • After unimodal pretraining, the entire FLAVA model is continue trained jointly on the three types of datasets with round-robin sampling.

In each training iteration, one of the datasets is chosen according to a sampling ratio, i.e. unimodal MIM on image data, unimodal MLM on text data, or the multimodal losses (contrastive, MMM, and ITM) on image-text pairs.

3.3. Public Multimodal Datasets (PMD)

Representative examples from various subsets of the pretraining dataset
Public Multimodal Datasets (PMD) corpus used in FLAVA multimodal pretraining, which consists of publicly available datasets with a total size of 70M image and text pairs.

The total count of text-image pairs is 70M, including 68M unique images, and the average caption length is 12.1 words.

  • For the YFCC100M dataset, the image-text data is filtered by discarding non-English captions and only keeping captions that contain more than two words.
Comparison of recent models in different modalities. CV&L and MV&L stands for cross-modal and multi-modal vision-andlanguage. * means the modality is partially targeted.
  • SimVLM, ALIGN, and CLIP have demonstrated impressive gains by training Transformer-based models on giant private paired image-and-text corpora, as opposed to the previous vision-and-language state-of-the-art such as VinVL, ALBEF, and ViLT [54], which were trained on smaller public paired datasets.

FLAVA has gathered 70M public data for pretraining.

4. Results

4.1. Ablation Study

Full FLAVA pretraining (row 6) achieves the best average scores on vision, language, and multimodal tasks compared to ablations.
  • For vision, 22 common vision tasks are evaluated.
  • For NLP, 8 tasks from the GLUE are evaluated.
  • For multimodal, VQAv2 [39], SNLI-VE [114], Hateful Memes [53], as well as Flickr30K [81] and COCO [66] image and text retrieval, are evaluated.
  • FLAVAC: trained with only image-text contrastive loss.
  • FLAVAMM: trained only on multimodal data,
  • FLAVA w/o unimodel init: models without unimodal initialization.

Full FLAVA: in row 6 outperforms all other settings in average performance over NLP, vision, and multimodal tasks.

Table 4. Comparing our full FLAVA pretraining with other settings, where FLAVA gets the highest macro average score.

Effective global contrastive loss in FLAVA

  • CLIP model trained on the same PMD data with the same ViT-B/16 image encoder as a baseline, denoted as CLIP in column 7.

Comparing column 3 vs 7, FLAVAC outperforms it in all vision, language, and multimodal domains.

  • This can be attributed to mostly two factors: different model details of FLAVA (e.g. 768 text encoder hidden size instead of 512) and performing global back-propagation across all GPU workers (GC Loss).

MMM and ITM objectives benefit multimodal tasks

  • FLAVAMM is pretrained using LMMM and LITM along with LGC.

Compared to FLAVAC with only the contrastive loss LGC (column 3 vs 4), this setting improves multimodal average score by +2.86%, NLP average score by +9%, and also vision average score slightly by +0.3%.

Joint unimodal & multimodal pretraining helps NLP

  • FLAVAMM losses are applied on PMD data batches, MIM loss is applied on IN-1k unimodal image data and MLM loss is applied on CCNews text data, as shown in Table 4 column 5.

Comparing it to FLAVAMM in column 4 with only multimodal pretraining, this joint unimodal and multimodal pretraining improves the NLP average score from 74.22 to 75.55.

  • This suggests that the additional text data from CCNews and BookCorpus benefits language understanding through the MLM objective.

Better image and text encoders via unimodal pretraining

  • For vision encoder, it is initialized from an off-the-shelf DINO model pretrained on ImageNet-1k.
  • For the language encoder, a ViT model is pretrained with MLM loss on CCNews and BookCorpus datasets.

Comparing column 5 vs 6, the pretrained encoders boost the performance of FLAVA on all tasks.

4.2. SOTA Comparison

Comparing FLAVA (Table 4 column 6) with previous models on multimodal tasks, language tasks, and ImageNet linear evaluation.

The full FLAVA largely outperforms previous multimodal approaches pretrained on public data (row 4 to 11) on both language and multimodal tasks and approaches the well-established BERT model on several GLUE tasks.

Compared to CLIP, FLAVA is trained on just 70M data which is 6× smaller.

The performance difference (relative, in %) between FLAVA and the released CLIP-ViT-B/16 (400M) on vision, language and multimodal tasks (positive means FLAVA is better).

FLAVA works significantly better on language and multimodal tasks while slightly worse than CLIP on some vision-only tasks.

  • In addition, FLAVA outperforms the variant of the CLIP model pretrained only on the PMD dataset (Table 5 row 10).



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.