Brief Review — CoCa: Contrastive Captioners are Image-Text Foundation Models
2.1 B Model Pretrained Using Contrastive Captioner (CoCa)
3 min readFeb 8, 2024
CoCa: Contrastive Captioners are Image-Text Foundation Models
Contrastive Captioner (CoCa), by Google Research
2023 TMLR, Over 720 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT]
==== My Other Paper Readings Are Also Over Here ====
- Contrastive Captioner (CoCa) is proposed, which is a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss.
- CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations.
Outline
- Contrastive Captioner (CoCa) Pretraining
- Results
1. Contrastive Captioner (CoCa) Pretraining
1.1. Pretraining Dataset
- JFT-3B dataset, in ViT-G, with label names as the paired texts, and the ALIGN dataset with noisy alt-texts are used for pretraining.
1.2. Dual-Encoder Contrastive Learning
- Compared to pretraining with single-encoder classification, the dual-encoder approach exploits noisy web-scale text descriptions and introduces a learnable text tower to encode free-form texts. The two encoders are jointly optimized by contrasting the paired text against others in the sampled batch:
1.3. Encoder-Decoder Captioning
- While the dual-encoder approach encodes the text as a whole, the generative approach (a.k.a. captioner) aims for detailed granularity and requires the model to predict the exact tokenized texts of y autoregressively, maximizing the conditional likelihood of the paired text y under the forward autoregressive factorization:
1.4. Proposed Contrastive Captioners Pretraining
- CoCa omits cross-attention in the first half of the decoder layers to encode unimodal text representations, and cascades the rest of the decoder layers, cross-attending to the image encoder for multimodal image-text representations.
- As a result, the CoCa decoder simultaneously produces both unimodal and multimodal text representations that allow us to apply both contrastive and generative objectives as:
- A single pooled image embedding helps visual recognition tasks as a global representation, while more visual tokens (thus more fine-grained) are beneficial for multimodal understanding tasks which require region-level features.
- CoCa model is pretrained with image resolution of 288×288 and patch size 18×18, resulting in a total of 256 image tokens.
1.5. Model Variants
- The largest CoCa model (“CoCa” in short) follows the ViT-giant setup in ViT-G with 1B-parameters in the image encoder and 2.1B-parameters altogether with the text decoder.
- Two smaller variants of “CoCa-Base” and “CoCa-Large” are also explored.
2. Results
- The core tasks of three categories are examined: (1) visual recognition, (2) crossmodal alignment, and (3) image captioning and multimodal understanding capabilities.
- The above figure summarizes the performance on key benchmarks of CoCa compared to other dual-encoder and encoder-decoder foundation models and state-of-the-art task-specialized methods.
CoCa sets new state-of-the-art results on tasks of all three categories with a single pretrained checkpoint.
- (If interested, please read the paper directly for more details.)