Brief Review — Multimodal Chain-of-Thought Reasoning in Language Models

MultiModal-CoT for Multi-modal Text & Image Inputs

4 min readDec 22, 2023

Multimodal Chain-of-Thought Reasoning in Language Models
MultiModal-CoT, Anonymous Authors (Still Under Review at 2024 ICLR)
2023 arXiv v3, Over 110 Citations (Sik-Ho Tsang @ Medium)
Visual/Vision/Video Language Model (VLM)
2017 … 2022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] [GLIP] 2023 [GPT-4] [GPT-4V(ision)]
==== My Other Paper Readings Are Also Over Here ====

Existing CoT studies have primarily focused on the language modality.
In this paper, Multimodal-CoT is proposed, which incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information.
(This paper is currently under review in 2024 ICLR.)

Outline

Multimodal-CoT
Results

1. Multimodal-CoT

Multimodal-CoT consists of two operation stages: (i) rationale generation and (ii) answer inference. Both stages share the same model structure but differ in the input X and output Y.

These 2 models (stages) are trained independently.

1.1. Rationale Generation Stage

In the rationale generation stage, the model is fed with:

where X1language represents the language input in the first stage and Xvision represents the vision input, i.e., the image.

X can be instantiated as a concatenation of question, context, and options of a multiple choice reasoning problem.
The goal is to learn a rationale generation model R = F(X) where R is the rationale

1.2. Answer Inference Stage

In the answer inference stage, the rationale R is appended to the original language input X1language as input of the second stage.

Together with Xvision as an updated input X’.

The answer inference model is to infer the final answer A = F(X′).

1.3. 1-Stage v.s. 2-Stage

1-stage reasoning and explanation obtain the answer directly based on input, obtains low accraucy, even lower than No-CoT.

The proposed 2-stage framework boost back the accuracy. With also the vision features, the accuracy is further boosted to 85.31%.

1.4. Model Architecture

Language encoder is the Transformer for encoding the language representation whereas vision extractor is the Vision Transformer (ViT) for encoding the image representation.

After obtaining language and vision representations, a single-head attention network is used to correlate text tokens with image patches. The attention output Hattnvision is: