Brief Review — CoCa: Contrastive Captioners are Image-Text Foundation Models

2.1 B Model Pretrained Using Contrastive Captioner (CoCa)

Sik-Ho Tsang
3 min readFeb 8, 2024

CoCa: Contrastive Captioners are Image-Text Foundation Models
Contrastive Captioner (CoCa)
, by Google Research
2023 TMLR, Over 720 Citations (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
20172023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT]
==== My Other Paper Readings Are Also Over Here ====

  • Contrastive Captioner (CoCa) is proposed, which is a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss.
  • CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations.

Outline

  1. Contrastive Captioner (CoCa) Pretraining
  2. Results

1. Contrastive Captioner (CoCa) Pretraining

1.1. Pretraining Dataset

  • JFT-3B dataset, in ViT-G, with label names as the paired texts, and the ALIGN dataset with noisy alt-texts are used for pretraining.

1.2. Dual-Encoder Contrastive Learning

  • Compared to pretraining with single-encoder classification, the dual-encoder approach exploits noisy web-scale text descriptions and introduces a learnable text tower to encode free-form texts. The two encoders are jointly optimized by contrasting the paired text against others in the sampled batch:

1.3. Encoder-Decoder Captioning

  • While the dual-encoder approach encodes the text as a whole, the generative approach (a.k.a. captioner) aims for detailed granularity and requires the model to predict the exact tokenized texts of y autoregressively, maximizing the conditional likelihood of the paired text y under the forward autoregressive factorization:

1.4. Proposed Contrastive Captioners Pretraining

CoCa Details
  • CoCa omits cross-attention in the first half of the decoder layers to encode unimodal text representations, and cascades the rest of the decoder layers, cross-attending to the image encoder for multimodal image-text representations.
  • As a result, the CoCa decoder simultaneously produces both unimodal and multimodal text representations that allow us to apply both contrastive and generative objectives as:
  • A single pooled image embedding helps visual recognition tasks as a global representation, while more visual tokens (thus more fine-grained) are beneficial for multimodal understanding tasks which require region-level features.
  • CoCa model is pretrained with image resolution of 288×288 and patch size 18×18, resulting in a total of 256 image tokens.

1.5. Model Variants

CoCa Variants
  • The largest CoCa model (“CoCa” in short) follows the ViT-giant setup in ViT-G with 1B-parameters in the image encoder and 2.1B-parameters altogether with the text decoder.
  • Two smaller variants of “CoCa-Base” and “CoCa-Large” are also explored.

2. Results

Overall Results
  • The core tasks of three categories are examined: (1) visual recognition, (2) crossmodal alignment, and (3) image captioning and multimodal understanding capabilities.
  • The above figure summarizes the performance on key benchmarks of CoCa compared to other dual-encoder and encoder-decoder foundation models and state-of-the-art task-specialized methods.

CoCa sets new state-of-the-art results on tasks of all three categories with a single pretrained checkpoint.

  • (If interested, please read the paper directly for more details.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.