Brief Review — Conditional Prompt Learning for Vision-Language Models

Conditional Context Optimization (CoCoOp) is Proposed

3 min readMay 16, 2024

**CoCoOp, Outperforms** **CoOp** **and** **CLIP**

Conditional Prompt Learning for Vision-Language Models
Conditional Context Optimization (CoCoOp), by Nanyang Technological University
2022 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI]
==== My Other Paper Readings Are Also Over Here ====

Conditional Context Optimization (CoCoOp) is proposed, which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector).
Compared to CoOp’s static prompts, the dynamic prompts adapt to each instance and are thus less sensitive to class shift.

Outline

Conditional Context Optimization (CoCoOp)
Results

1. Conditional Context Optimization (CoCoOp)

1.1. CLIP

CLIP uses text encoder to encode texts/prompts into text representation and uses image encoder to encode image into image representation.
The whole CLIP is pretrained using cosine similarity contrastive learning, which are then used for other downstream tasks.

1.2. CoOp

In CoOp, additional learnable contexts are prepended in such a way that CoOp avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data.

CoOp is a data-efficient approach allowing the context vectors to be trained with only a few labeled images in a downstream dataset. However, CoOp is not generalizable to wider unseen classes within the same task.

1.3. CoCoOp

CoCoOp further learns a lightweight neural network, called Meta-Net, to generate for each input a conditional token (vector), which is then combined with the context vectors.

With ti(x) = {v1(x), v2(x), …, vM(x), ci}, the prediction probability is computed as:

In this work, the Meta-Net is built with a two-layer bottleneck structure (Linear-ReLU-Linear), with the hidden layer reducing the input dimension by 16×.

Instance-conditional context can generalize better because it shifts the focus away from a specific set of classes.

2. Results

**CLIP**, **CoOp** **and CoCoOp Comparisons**

Learning-based models, i.e., CoOp and CoCoOp, are trained using only the base classes while evaluation is conducted on the base and new classes separately to test generalizability.

Table 1(a): CoCoOp improves the accuracy in unseen classes from 63.22% to 71.69%, which largely reduces the gap with manual prompts.

**Comprehensive comparisons of CoCoOp and** **CoOp** **in the base-to-new generalization setting**

Figure 3(a): It is observed more than 10% increases in accuracy on 5 out of 11 datasets for CoCoOp compared with CoOp.
Figure 3(b): CoCoOp’s Gains in Generalization Far Outweigh Losses in Base Accuracy. In comparison to CoOp, performance drops in the base classes occur for CoCoOp on most datasets.

Brief Review — Conditional Prompt Learning for Vision-Language Models

Conditional Context Optimization (CoCoOp) is Proposed

Outline

1. Conditional Context Optimization (CoCoOp)

1.1. CLIP

1.2. CoOp

1.3. CoCoOp

2. Results

Written by Sik-Ho Tsang