Review — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain-of-Thought Prompting, Improves Upon Standard Prompting

Sik-Ho Tsang
3 min readAug 12, 2023

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting, by Google Research, Brain Team
2022 NeurIPS, Over 1000 Citations (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2023
[GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2]
==== My Other Paper Readings Are Also Over Here ====

  • Chain of thought prompting is proposed, which is a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.
  • Indeed, it is simple but very effective.

Outline

  1. Chain-of-Thought Prompting
  2. Results

1. Chain-of-Thought Prompting

Chain-of-Thought Prompting
  • In standard prompting, only a series of {Q A Q A … Q} is prompted to get the answer of the last question.

In Chain-of-Thought Prompting, a series of {Q T A Q T A … Q} is prompted to get the chain-of-thought and answer of the last question, where T is Chain-of-Thought.

Chain-of-Thought Examples
  • Green is for maths, orange is for commonsense reasoning, and blue is for symbolic reasoning.

2. Results

2.1. Maths

Maths Problem
  • There are 3 takeways:
  1. Chain-of-thought prompting does not positively impact performance for small models, and only yields performance gains when used with models of 100B parameters.
  2. Chain-of-thought prompting has larger performance gains for more-complicated problems.
  3. Chain-of-thought prompting via GPT-3 175B and PaLM 540B compares favorably to prior state of the art, which typically finetunes a task-specific model on a labeled training dataset.
Robustness

Different annotators having different prompts input to model still yield better results.

Using three sets of eight exemplars randomly sampled from the GSM8K training set as prompts also yield better results.

2.2. Commonsense Reasoning

Commonsense Reasoning

For all tasks, scaling up model size improved the performance of standard prompting; chain-of-thought prompting led to further gains.

2.3. Symbolic Reasoning

Symbolic Reasoning
  • Two tasks are used for evaluation:
  1. Last letter concatenation. This task asks the model to concatenate the last letters of words in a name (e.g., “Amy Brown”!“yn”).
  2. Coin flip. This task asks the model to answer whether a coin is still heads up after people either flip or don’t flip the coin (e.g., “A coin is heads up. Phoebe flips the coin. Osvaldo does not flip the coin. Is the coin still heads up?” ! “no”).
  • In-domain test set: examples had the same number of steps as the training/few-shot exemplars.
  • Out-of-domain (OOD) test set: examples had more steps than those in the exemplars.

With PaLM 540B, chain-of-thought prompting leads to almost 100% solve rates (note that standard prompting already solves coin flip with PaLM 540, though not for LaMDA 137B).

  • And yet, small models still fail.

As for the OOD evaluations, standard prompting fails for both tasks. With chain-of-thought prompting, language models achieve upward scaling curves.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.