Brief Review — Conditional Prompt Learning for Vision-Language Models
Conditional Context Optimization (CoCoOp) is Proposed
Conditional Prompt Learning for Vision-Language Models
Conditional Context Optimization (CoCoOp), by Nanyang Technological University
2022 CVPR, Over 800 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2023 [GPT-4] [GPT-4V(ision)] [MultiModal-CoT] [CoCa] [Florence-2] [PaLI]
==== My Other Paper Readings Are Also Over Here ====
- Conditional Context Optimization (CoCoOp) is proposed, which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector).
- Compared to CoOp’s static prompts, the dynamic prompts adapt to each instance and are thus less sensitive to class shift.
Outline
- Conditional Context Optimization (CoCoOp)
- Results
1. Conditional Context Optimization (CoCoOp)
1.1. CLIP
- CLIP uses text encoder to encode texts/prompts into text representation and uses image encoder to encode image into image representation.
- The whole CLIP is pretrained using cosine similarity contrastive learning, which are then used for other downstream tasks.
1.2. CoOp
- In CoOp, additional learnable contexts are prepended in such a way that CoOp avoids manual prompt tuning by modeling context words with continuous vectors that are end-to-end learned from data.
CoOp is a data-efficient approach allowing the context vectors to be trained with only a few labeled images in a downstream dataset. However, CoOp is not generalizable to wider unseen classes within the same task.
1.3. CoCoOp
- CoCoOp further learns a lightweight neural network, called Meta-Net, to generate for each input a conditional token (vector), which is then combined with the context vectors.
- With ti(x) = {v1(x), v2(x), …, vM(x), ci}, the prediction probability is computed as:
- In this work, the Meta-Net is built with a two-layer bottleneck structure (Linear-ReLU-Linear), with the hidden layer reducing the input dimension by 16×.
Instance-conditional context can generalize better because it shifts the focus away from a specific set of classes.
2. Results
- Learning-based models, i.e., CoOp and CoCoOp, are trained using only the base classes while evaluation is conducted on the base and new classes separately to test generalizability.
Table 1(a): CoCoOp improves the accuracy in unseen classes from 63.22% to 71.69%, which largely reduces the gap with manual prompts.