Brief Review — CoCa: Contrastive Captioners are Image-Text Foundation Models

2.1 B Model Pretrained Using Contrastive Captioner (CoCa)

3 min readFeb 8, 2024

CoCa: Contrastive Captioners are Image-Text Foundation Models
Contrastive Captioner (CoCa), by Google Research
2023 TMLR, Over 720 Citations (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT]
==== My Other Paper Readings Are Also Over Here ====

Contrastive Captioner (CoCa) is proposed, which is a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss.
CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations.

Outline

Contrastive Captioner (CoCa) Pretraining
Results

1. Contrastive Captioner (CoCa) Pretraining

1.1. Pretraining Dataset

JFT-3B dataset, in ViT-G, with label names as the paired texts, and the ALIGN dataset with noisy alt-texts are used for pretraining.

1.2. Dual-Encoder Contrastive Learning

Compared to pretraining with single-encoder classification, the dual-encoder approach exploits noisy web-scale text descriptions and introduces a learnable text tower to encode free-form texts. The two encoders are jointly optimized by contrasting the paired text against others in the sampled batch:

1.3. Encoder-Decoder Captioning

While the dual-encoder approach encodes the text as a whole, the generative approach (a.k.a. captioner) aims for detailed granularity and requires the model to predict the exact tokenized texts of y autoregressively, maximizing the conditional likelihood of the paired text y under the forward autoregressive factorization:

1.4. Proposed Contrastive Captioners Pretraining

CoCa omits cross-attention in the first half of the decoder layers to encode unimodal text representations, and cascades the rest of the decoder layers, cross-attending to the image encoder for multimodal image-text representations.
As a result, the CoCa decoder simultaneously produces both unimodal and multimodal text representations that allow us to apply both contrastive and generative objectives as:

A single pooled image embedding helps visual recognition tasks as a global representation, while more visual tokens (thus more fine-grained) are beneficial for multimodal understanding tasks which require region-level features.
CoCa model is pretrained with image resolution of 288×288 and patch size 18×18, resulting in a total of 256 image tokens.

1.5. Model Variants

The largest CoCa model (“CoCa” in short) follows the ViT-giant setup in ViT-G with 1B-parameters in the image encoder and 2.1B-parameters altogether with the text decoder.
Two smaller variants of “CoCa-Base” and “CoCa-Large” are also explored.

2. Results

The core tasks of three categories are examined: (1) visual recognition, (2) crossmodal alignment, and (3) image captioning and multimodal understanding capabilities.
The above figure summarizes the performance on key benchmarks of CoCa compared to other dual-encoder and encoder-decoder foundation models and state-of-the-art task-specialized methods.

CoCa sets new state-of-the-art results on tasks of all three categories with a single pretrained checkpoint.

(If interested, please read the paper directly for more details.)

Brief Review — CoCa: Contrastive Captioners are Image-Text Foundation Models

2.1 B Model Pretrained Using Contrastive Captioner (CoCa)

Outline

1. Contrastive Captioner (CoCa) Pretraining

1.1. Pretraining Dataset

1.2. Dual-Encoder Contrastive Learning

1.3. Encoder-Decoder Captioning

1.4. Proposed Contrastive Captioners Pretraining

1.5. Model Variants

2. Results

Written by Sik-Ho Tsang