# Brief Review — Chinchilla: Training Compute-Optimal Large Language Models

## Chinchilla, 70B Model, Much Smaller Than GPT-3 & MT-NLG 530B

--

Training Compute-Optimal Large Language Models,
Chinchilla, by DeepMind,
2022 NeurIPS, Over 170 Citations (Sik-Ho Tsang @ Medium)
Large Language Model, LLM, Foundation Model

Language Model
2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

• The optimal model size and number of tokens for training a Transformer language model under a given compute budget is investigated, by training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.
• It is found that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
• (It is quite interesting that some model namings are going to animal species, such as Jurassic-1, Gopher, and Flamingo.)

# Outline

1. Estimating the optimal parameter/training tokens allocation & Chinchilla
2. Chinchilla Results

# 1. Estimating the optimal parameter/training tokens allocation & Chinchilla

## 1.1. Problem

Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?

• To answer this question, the final pre-training loss 𝐿(𝑁,𝐷) is modelled as a function of the number of model parameters 𝑁, and the number of training tokens 𝐷.
• Since the computational budget 𝐶 is a deterministic function FLOPs(𝑁,𝐷) of the number of seen training tokens and model parameters, we are interested in minimizing 𝐿 under the constraint FLOPs(𝑁,𝐷)=𝐶:
• where the functions 𝑁𝑜𝑝𝑡(𝐶), and 𝐷𝑜𝑝𝑡(𝐶) describe the optimal allocation of a computational budget 𝐶.

These functions are empirically estimated based on the losses of over 400 models, ranging from under 70M to over 16B parameters, and trained on 5B to over 400B tokens.

## 1.2. Parameter/Tokens Allocation Results

• Three predictions are tried to estimate the optimal parameter/training tokens allocation.
• Approach 1: Fix model sizes and vary number of training tokens.
• Approach 2: IsoFLOP profiles: The model size is varied for a fixed set of 9 different training FLOP counts.
• Approach 3: Fitting a parametric loss function: All final losses from experiments in Approach 1 & 2 are modelled as a parametric function of model parameter count and the number of seen tokens.
• The predictions are overlaid from three different approaches, along with projections from Kaplan et al. (2020) as above.
• It is found that all three methods predict that current large models should be substantially smaller and therefore trained much longer than is currently done.
• Based on the estimated compute-optimal frontier, for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens.

## 1.3. Chinchilla

A more compute-optimal 70B model, called Chinchilla, is trained on 1.4 trillion tokens. Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably.

# 2. Results

## 2.2. Performance of Some Evaluation Tasks

On all subsets, Chinchilla outperforms Gopher.

On this benchmark, Chinchilla significantly outperforms Gopher despite being much smaller, with an average accuracy of 67.6% (improving upon Gopher by 7.6%).

Chinchilla outperforms Gopher by 7.6% on average, performing better on 51/57 individual tasks, the same on 2/57, and worse on only 4/57 tasks.

On RACE-h and RACE-m, Chinchilla considerably improves performance over Gopher. On LAMBADA, Chinchilla outperforms both Gopher and MT-NLG 530B.

Chinchilla outperforms Gopher on all but four BIG-bench tasks considered.