Brief Review — Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self Consistency is Proposed, Outperforms CoT Prompting

Sik-Ho Tsang
3 min readDec 6, 2023

Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self Consistency, by
2023 ICLR, Over 350 Citations (Sik-Ho Tsang @ Medium)

LM Tuning / Prompting
2020 [Human Feedback Model] … 2023 [LIMA] [SELF-INTRUCT] [LLaMA-Adapter]
==== My Other Paper Readings Are Also Over Here ====

  • Self-consistency is proposed to replace the naive greedy decoding used in chain-of-thought (CoT) prompting.
  • It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths.
  • Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.


  1. Self Consistency
  2. Results

1. Self Consistency

Self Consistency
  • (It is assumed that CoT is already known.)

1) First, a language model is prompted with a set of manually written chain-of-thought exemplars.

2) Next, a set of candidate outputs is sampled from the language model’s decoder, generating a diverse set of candidate reasoning paths.

3) Finally, the answers are aggregated by marginalizing out the sampled reasoning paths and choosing the answer that is the most consistent among the generated answers.

2. Results

Majority Vote is the best

Majority Vote is the best aggregation strategies.

CoT vs Self Consistency

Self-consistency improves the arithmetic reasoning performance over all four language models significantly over chain-of-thought prompting.

Similarly, self-consistency yields large gains across all four language models, and obtained SoTA results on 5 out of 6 tasks.

Sampling a higher number (e.g., 40) of reasoning paths leads to a consistently better performance.

For some tasks (e.g., ANLI-R1, e-SNLI, RTE), adding chain-of-thought does hurt performance compared to standard prompting (GPT-3), but self-consistency is able to robustly boost the performance and outperform standard prompting.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.