Brief Review — Multimodal Chain-of-Thought Reasoning in Language Models

MultiModal-CoT for Multi-modal Text & Image Inputs

Sik-Ho Tsang
4 min readDec 22, 2023
Example of the multimodal CoT task.

Multimodal Chain-of-Thought Reasoning in Language Models
, Anonymous Authors (Still Under Review at 2024 ICLR)
2023 arXiv v3, Over 110 Citations (Sik-Ho Tsang @ Medium)

Visual/Vision/Video Language Model (VLM)
20172022 [FILIP] [Wukong] [LiT] [Flamingo] [FLAVA] [SimVLM] [VLMo] [BEiT-3] [GLIP] 2023 [GPT-4] [GPT-4V(ision)]
==== My Other Paper Readings Are Also Over Here ====

  • Existing CoT studies have primarily focused on the language modality.
  • In this paper, Multimodal-CoT is proposed, which incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information.
  • (This paper is currently under review in 2024 ICLR.)


  1. Multimodal-CoT
  2. Results

1. Multimodal-CoT

Multimodal-CoT Framework
  • Multimodal-CoT consists of two operation stages: (i) rationale generation and (ii) answer inference. Both stages share the same model structure but differ in the input X and output Y.
  • These 2 models (stages) are trained independently.

1.1. Rationale Generation Stage

  • In the rationale generation stage, the model is fed with:

where X1language represents the language input in the first stage and Xvision represents the vision input, i.e., the image.

  • X can be instantiated as a concatenation of question, context, and options of a multiple choice reasoning problem.
  • The goal is to learn a rationale generation model R = F(X) where R is the rationale

1.2. Answer Inference Stage

In the answer inference stage, the rationale R is appended to the original language input X1language as input of the second stage.

  • Together with Xvision as an updated input X.

The answer inference model is to infer the final answer A = F(X′).

1.3. 1-Stage v.s. 2-Stage

1-Stage v.s. 2-Stage
  • 1-stage reasoning and explanation obtain the answer directly based on input, obtains low accraucy, even lower than No-CoT.

The proposed 2-stage framework boost back the accuracy. With also the vision features, the accuracy is further boosted to 85.31%.

1.4. Model Architecture

Language encoder is the Transformer for encoding the language representation whereas vision extractor is the Vision Transformer (ViT) for encoding the image representation.

  • After obtaining language and vision representations, a single-head attention network is used to correlate text tokens with image patches. The attention output Hattnvision is:
  • Then, the gated fusion mechanism is applied:
  • Finally, the fused output Hfuse is fed into the Transformer decoder to predict the target Y.

2. Results

2.1. ScienceQA


Mutimodal-CoTLarge achieves substantial performance gains over the prior best model in publications (86.54%→90.45%).

2.2. A-OKVQA


The efficacy of Multimodal-CoT is further supported by the results obtained from the A-OKVQA benchmark.

2.3. Ablation Study

Different backbone LMs.

The proposed approach is generally effective for the widely used backbone models.

Different vision features

ViT achieves relatively better performance.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.