Brief Review — Conditional Prompt Learning for Vision-Language Models

Conditional Context Optimization (CoCoOp) is Proposed

Sik-Ho Tsang
3 min readMay 16, 2024
CoCoOp, Outperforms CoOp and CLIP

Conditional Prompt Learning for Vision-Language Models
Conditional Context Optimization (CoCoOp)
, by Nanyang Technological University
2022 CVPR, Over 800 Citations (

@ Medium)

Visual/Vision/Video Language Model (VLM)
2017
2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI]
==== My Other Paper Readings Are Also Over Here ====

  • Conditional Context Optimization (CoCoOp) is proposed, which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector).
  • Compared to CoOp’s static prompts, the dynamic prompts adapt to each instance and are thus less sensitive to class shift.

Outline

  1. Conditional Context Optimization (CoCoOp)
  2. Results

1. Conditional Context Optimization (CoCoOp)

1.1. CLIP

CLIP
  • CLIP uses text encoder to encode texts/prompts into text representation and uses image encoder to encode image into image representation.
  • The whole CLIP is pretrained using cosine similarity contrastive learning, which are then used for other downstream tasks.

1.2. CoOp

CoOp
  • In CoOp, additional learnable contexts are prepended in such a way that CoOp avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data.

CoOp is a data-efficient approach allowing the context vectors to be trained with only a few labeled images in a downstream dataset. However, CoOp is not generalizable to wider unseen classes within the same task.

1.3. CoCoOp

CoCoOp
  • CoCoOp further learns a lightweight neural network, called Meta-Net, to generate for each input a conditional token (vector), which is then combined with the context vectors.
  • With ti(x) = {v1(x), v2(x), …, vM(x), ci}, the prediction probability is computed as:
  • In this work, the Meta-Net is built with a two-layer bottleneck structure (Linear-ReLU-Linear), with the hidden layer reducing the input dimension by 16×.

Instance-conditional context can generalize better because it shifts the focus away from a specific set of classes.

2. Results

CLIP, CoOp and CoCoOp Comparisons
  • Learning-based models, i.e., CoOp and CoCoOp, are trained using only the base classes while evaluation is conducted on the base and new classes separately to test generalizability.

Table 1(a): CoCoOp improves the accuracy in unseen classes from 63.22% to 71.69%, which largely reduces the gap with manual prompts.

Comprehensive comparisons of CoCoOp and CoOp in the base-to-new generalization setting
  • Figure 3(a): It is observed more than 10% increases in accuracy on 5 out of 11 datasets for CoCoOp compared with CoOp.
  • Figure 3(b): CoCoOp’s Gains in Generalization Far Outweigh Losses in Base Accuracy. In comparison to CoOp, performance drops in the base classes occur for CoCoOp on most datasets.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.