Brief Review — Multimodal Chain-of-Thought Reasoning in Language Models
MultiModal-CoT for Multi-modal Text & Image Inputs
Multimodal Chain-of-Thought Reasoning in Language Models
MultiModal-CoT, Anonymous Authors (Still Under Review at 2024 ICLR)
2023 arXiv v3, Over 110 Citations (Sik-Ho Tsang @ Medium)Visual/Vision/Video Language Model (VLM)
2017 … 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] [GLIP] 2023 [GPT-4] [GPT-4V(ision)]
==== My Other Paper Readings Are Also Over Here ====
- Existing CoT studies have primarily focused on the language modality.
- In this paper, Multimodal-CoT is proposed, which incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information.
- (This paper is currently under review in 2024 ICLR.)
Outline
- Multimodal-CoT
- Results
1. Multimodal-CoT
- Multimodal-CoT consists of two operation stages: (i) rationale generation and (ii) answer inference. Both stages share the same model structure but differ in the input X and output Y.
- These 2 models (stages) are trained independently.
1.1. Rationale Generation Stage
- In the rationale generation stage, the model is fed with:
where X1language represents the language input in the first stage and Xvision represents the vision input, i.e., the image.
- X can be instantiated as a concatenation of question, context, and options of a multiple choice reasoning problem.
- The goal is to learn a rationale generation model R = F(X) where R is the rationale
1.2. Answer Inference Stage
In the answer inference stage, the rationale R is appended to the original language input X1language as input of the second stage.
- Together with Xvision as an updated input X’.
The answer inference model is to infer the final answer A = F(X′).
1.3. 1-Stage v.s. 2-Stage
- 1-stage reasoning and explanation obtain the answer directly based on input, obtains low accraucy, even lower than No-CoT.
The proposed 2-stage framework boost back the accuracy. With also the vision features, the accuracy is further boosted to 85.31%.
1.4. Model Architecture
Language encoder is the Transformer for encoding the language representation whereas vision extractor is the Vision Transformer (ViT) for encoding the image representation.
- After obtaining language and vision representations, a single-head attention network is used to correlate text tokens with image patches. The attention output Hattnvision is:
- Then, the gated fusion mechanism is applied:
- Finally, the fused output Hfuse is fed into the Transformer decoder to predict the target Y.
2. Results
2.1. ScienceQA
Mutimodal-CoTLarge achieves substantial performance gains over the prior best model in publications (86.54%→90.45%).
2.2. A-OKVQA
The efficacy of Multimodal-CoT is further supported by the results obtained from the A-OKVQA benchmark.
2.3. Ablation Study
The proposed approach is generally effective for the widely used backbone models.
ViT achieves relatively better performance.