# Brief Review — Chinchilla: Training Compute-Optimal Large Language Models

## Chinchilla, 70B Model, Much Smaller Than GPT-3 & MT-NLG 530B

Training Compute-Optimal Large Language Models,Chinchilla, by DeepMind,2022 NeurIPS, Over 170 Citations(Sik-Ho Tsang @ Medium)

Large Language Model, LLM, Foundation Model

Language Model[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B]

20222023[GPT-4]

==== My Other Paper Readings Are Also Over Here ====

- The optimal
**model size**and**number of tokens**for training a Transformer language model under a**given compute budget**is investigated, by training over**400 language models**ranging**from 70 million to over 16 billion parameters**on**5 to 500 billion tokens**. - It is found that for compute-optimal training,
**the model size and the number of training tokens should be scaled equally**: for every doubling of model size the number of training tokens should also be doubled. - (It is quite interesting that some model namings are going to animal species, such as Jurassic-1, Gopher, and Flamingo.)

# Outline

**Estimating the optimal parameter/training tokens allocation & Chinchilla****Chinchilla Results**

**1. Estimating the optimal parameter/training tokens allocation & Chinchilla**

## 1.1. Problem

Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?

- To answer this question, the
**final pre-training loss 𝐿(𝑁,𝐷)**is modelled as a function of the**number of model parameters 𝑁**, and the**number of training tokens 𝐷**. - Since the
**computational budget 𝐶**is a**deterministic function****FLOPs(𝑁,𝐷)**of the number of seen training tokens and model parameters, we are interested in**minimizing 𝐿 under the constraint FLOPs(𝑁,𝐷)=𝐶**:

- where the functions
**𝑁𝑜𝑝𝑡(𝐶)**, and**𝐷𝑜𝑝𝑡(𝐶)**describe the**optimal allocation of a computational budget 𝐶.**

These functions are

empirically estimatedbased on the losses of over400 models, ranging fromunder 70M to over 16B parameters, and trained on5B to over 400B tokens.

## 1.2. Parameter/Tokens Allocation Results

**Three predictions**are tried to**estimate the optimal parameter/training tokens allocation.****Approach 1**:**Fix model sizes**and**vary number of training tokens**.**Approach 2: IsoFLOP profiles**: The**model size is varied**for**a fixed set of 9 different training FLOP counts.****Approach 3: Fitting a parametric loss function**: All final losses from experiments in Approach 1 & 2 are**modelled as a parametric function**of model parameter count and the number of seen tokens.- (Please feel free to read the paper directly for more details about the 3 approaches.)

- The predictions are overlaid from
**three different approaches**, along with projections from Kaplan et al. (2020) as above. - It is found that
**all three methods predict that current large models should be substantially smaller**and therefore trained much longer than is currently done. - Based on the estimated compute-optimal frontier, for the compute budget used to
**train****Gopher**,**an optimal model should be 4 times smaller**, while being training on**4 times more tokens**.

**1.3. Chinchilla**

A more compute-optimal

70B model, calledChinchilla, is trained on1.4 trillion tokens. Not only doesChinchilla outperform its much larger counterpart,Gopher, but itsreduced model sizereduces inference cost considerably.

# 2. Results

## 2.1. Evaluation Tasks

## 2.2. Performance of Some Evaluation Tasks

On all subsets,

Chinchilla outperformsGopher.

On this benchmark,

Chinchilla significantly outperformsGopherdespite being much smaller, with anaverage accuracy of 67.6% (improving uponGopherby 7.6%).

Chinchilla outperformsGopherby 7.6% on average, performing better on 51/57 individual tasks, the same on 2/57, and worse on only 4/57 tasks.

On RACE-h and RACE-m, Chinchilla

considerably improves performance overGopher.On LAMBADA, Chinchillaoutperforms bothGopherandMT-NLG530B.

Chinchilla

outperformsGopheron all but fourBIG-bench tasks considered.

(For other tasks, please feel free to read the paper directly.)