Review — FILIP: Fine-grained Interactive Language-Image Pre-Training

FILIP, Token-wise Cross-Modal Late Interaction

Sik-Ho Tsang
7 min readDec 14, 2022

FILIP: Fine-grained Interactive Language-Image Pre-Training,
FILIP, by Huawei Noah’s Ark Lab, Hong Kong University of Science and Technology, and Sun Yat-sen University
2022 ICLR, Over 80 Citations (Sik-Ho Tsang @ Medium)
Vision Language Model, VLM

  • Instead of modeling cross-modal interaction via only the global features of the entire image and text sequence such as in CLIP and ALIGN, in this paper, the fine-grained interaction between image patches and textual tokens is taked into account.
  • Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective.
  • Image and text representations are precomputed offline to keep inference efficient.


  1. FILIP
  2. Other Details
  3. Results


Overall architecture of FILIP, a dual-stream model with Transformer-based image and text encoders.

1.1. Model Architecture

  • FILIP is a dual-stream model with Transformer-based image and text encoders.
  • For the visual modality, the image encoder is a Vision Transformer, ViT, which takes the concatenation of an extra [CLS] token embedding and linearly projected image patches as input.
  • For the textual modality, after the word embedding layer, the token embeddings are fed into a modified decoder-only Transformer.
  • On top of the image and text encoders, the representations of textual tokens and visual tokens are linearly projected to the multi-modal common space, separately L2-normalized, and interacts with each other.

1.2. Contrastive Learning

  • xI and xT are image and text respectively.
  • The aim is to have the encoded representations (xI) and (xT) are close if they are related and far apart if not, under a distance metric.
  • In each training batch, b image-text pairs {xIk, xTk} where k is from 1 to b are sampled. For image xIk in image-text pair {xIk, xTk}, xTk is its positive, while the other texts will be used as in-batch negatives. The image-to-text contrastive loss LIk for xIk can then be formulated as:
  • where sIk,j denotes the similarity of the k-th image to the j-th text.
  • Similarly, the text-to-image contrastive loss for xTk is:
  • The total loss of this mini-batch can be represented by:

1.3. Cross-Modal Late Interaction

  • CLIP and ALIGN simply encodes image and text as global feature then compute these two similarities as:

The drawback is that it neglects finer-grained interactions (e.g., word-patch alignment) between the two modalities.

  • Denote n1 and n2 as the number of (non-padded) tokens of the i-th image and j-th text, and the corresponding encoded features are (xIi) and (xTj), with the sizes of nd and n2×d respectively.
  • For the k-th visual token, we compute its similarities with all textual tokens of xTj, and use the largest one:
  • as its token-wise maximum similarity with xTj.
  • Then, the average token-wise maximum similarity of all non-padded tokens in the image (resp. text) is used as the similarity of an image to a text (resp. a text to an image). The similarity of the i-th image to the j-th text can thus be formulated as:
  • where:
  • Similarly, the similarity of the j-th text to the i-th image is:
  • where:

Intuitively, the token-wise maximum similarity means that for each image patch, we find its most similar textual token. Similarly, for each textual token, we also find its closest image patch.

By applying this to the similarity calculation for contrastive loss, the dual-stream model learns fine-grained alignment between image patches and textual tokens.

2. Other Details

2.1. Training Efficiency

  • FILIP can be inefficient in terms of communication, memory and computation.
  • Firstly, the embedding size is reduced to 256.
  • Besides, the precision of the last-layer features of both modalities is reduced from fp32 to fp16 before node communication.
  • Only the 25% tokens with the highest token-wise maximum similarity score among all texts (resp. images) in the same local worker before node communication, based on the intuition that each sample can be represented by a few of the most representative tokens.

2.2. Prompt Ensemble and Templates

  • Prompt templates to augment the original label for some downstream tasks, e.g.: “a photo of a {label}.”.
  • Suppose there are C prompt templates, each label is augmented to C different texts xT1, xT2, …, xTC. The similarity between an image xI and this label is computed:

2.3. Image and Text Augmentation

  • AutoAugment is used for image augmentation.
  • Back Translation is used for text augmentation. Specifically, the texts are first translated to the target language and then translated back to the source language. German and Russian are chosen as the target language and get extra two texts for each image-text pair. Then, each image-text pair is randomly sampled from the three candidate texts.

2.4. Pretraining Dataset

  • CLIP and ALIGN construct datasets with 400M and 1800M image-text pairs, respectively.
  • FILIP300M, consists of 300M image-text pairs, from the Internet.
  • Besides, 3 public datasets are also used, including Conceptual Captions 3M (CC3M), Conceptual 12M (CC12M), and Yahoo Flickr Creative Commons 100M (YFCC100M). About 340M image-text pairs are used for pre-training.

Despite using a smaller training dataset than CLIP and ALIGN, FILIP models still outperform them in most down-steam tasks as below.

3. Results

3.1. Zero-Shot Classification

Top-1 accuracy(%) of zero-shot image classification on 12 datasets.

Despite using less training data (340M vs. 400M), both FILIPbase and FILIPlarge considerably outperform their CLIP counterparts in terms of average top-1 accuracy over 12 datasets, i.e., achieving absolute improvements of 5.6% and 3.0%, respectively.

FILIP focuses more on the target object by directly aligning the image patches of the target object with the textual tokens corresponding to the class label.

3.2. Linear Probe Classification

Top-1 accuracy(%) of linear probe on image classification on 12 datasets.
  • For the linear probe results, and FILIP again outperforms CLIP by 1.2~1.8% points on average.

3.3. Zero-Shot & Fine-Tuned Image-Text Retrieval

Results of zero-shot image-text retrieval on Flickr30K and MSCOCO datasets.
Results of fine-tuned image-text retrieval on Flickr30K and MSCOCO datasets.

FILIP achieves state-of-the-art performances under all metrics on both Flickr30K and MSCOCO datasets, except for zero-shot text-to-image retrieval on Flickr30K.

3.4. Ablation Study

Ablation study of different components on pre-training subset of YFCC100M.

As can be seen, all three components are beneficial for both tasks. Despite the simple design, cross-modal late interaction brings significant performance improvements over the baseline.

Efficiency study of the cross-modal late interaction.
  • Different settings are tried.

* denotes the final setting used in other experiments, but it is not the best one. Yet, it got much fewer memory requirement and much fewer training time.

3.5. Visualization

Visualizations of word-patch alignment for 4 classes of the ImageNet dataset and “a photo of a {label}.” is the prompt.
  • (a) Balloon (5): Take class “balloon” as an example. There are 8 tokens in the tokenized textual sequence “[BOS] a photo of a balloon. [EOS]”, and the location index of the class label “balloon” is “5”.

As can be seen, FILIP exhibits the finer-grained understanding of an image.

  • (d) Electric locomotive (5, 6): there are two key components crucial to correctly classifying the image, i.e., “electric” and “locomotive”, whose corresponding textual token indices are “5” and “6”, respectively.

As can be seen, image patches matching these two key components are respectively correctly classified. On the other hand, CLIP cannot correctly align image patches with corresponding textual tokens.

Authors mentioned some further works in the future: a more advanced image encoder, a well-designed interaction layer can be used to boost the performance, masked language/image loss, generic and unified interface for other tasks.


[2022 ICLR] [FILIP]
FILIP: Fine-grained Interactive Language-Image Pre-Training

3.1. Visual/Vision/Video Language Model (VLM)

2018 [Conceptual Captions] 2019 [VideoBERT] [VisualBERT] [LXMERT] [ViLBERT] 2020 [ConVIRT] [VL-BERT] [OSCAR] 2021 [CLIP] [VinVL] [ALIGN] 2022 [FILIP]

==== My Other Previous Paper Readings ====



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.