Brief Review — Chinchilla: Training Compute-Optimal Large Language Models
- The optimal model size and number of tokens for training a Transformer language model under a given compute budget is investigated, by training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.
- It is found that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
- (It is quite interesting that some model namings are going to animal species, such as Jurassic-1, Gopher, and Flamingo.)
- Estimating the optimal parameter/training tokens allocation & Chinchilla
- Chinchilla Results
1. Estimating the optimal parameter/training tokens allocation & Chinchilla
Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?
- To answer this question, the final pre-training loss 𝐿(𝑁,𝐷) is modelled as a function of the number of model parameters 𝑁, and the number of training tokens 𝐷.
- Since the computational budget 𝐶 is a deterministic function FLOPs(𝑁,𝐷) of the number of seen training tokens and model parameters, we are interested in minimizing 𝐿 under the constraint FLOPs(𝑁,𝐷)=𝐶:
- where the functions 𝑁𝑜𝑝𝑡(𝐶), and 𝐷𝑜𝑝𝑡(𝐶) describe the optimal allocation of a computational budget 𝐶.
These functions are empirically estimated based on the losses of over 400 models, ranging from under 70M to over 16B parameters, and trained on 5B to over 400B tokens.
1.2. Parameter/Tokens Allocation Results
- Three predictions are tried to estimate the optimal parameter/training tokens allocation.
- Approach 1: Fix model sizes and vary number of training tokens.
- Approach 2: IsoFLOP profiles: The model size is varied for a fixed set of 9 different training FLOP counts.
- Approach 3: Fitting a parametric loss function: All final losses from experiments in Approach 1 & 2 are modelled as a parametric function of model parameter count and the number of seen tokens.
- (Please feel free to read the paper directly for more details about the 3 approaches.)
- The predictions are overlaid from three different approaches, along with projections from Kaplan et al. (2020) as above.
- It is found that all three methods predict that current large models should be substantially smaller and therefore trained much longer than is currently done.
- Based on the estimated compute-optimal frontier, for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens.
A more compute-optimal 70B model, called Chinchilla, is trained on 1.4 trillion tokens. Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably.
2.1. Evaluation Tasks
2.2. Performance of Some Evaluation Tasks
On all subsets, Chinchilla outperforms Gopher.
Chinchilla outperforms Gopher by 7.6% on average, performing better on 51/57 individual tasks, the same on 2/57, and worse on only 4/57 tasks.
Chinchilla outperforms Gopher on all but four BIG-bench tasks considered.
(For other tasks, please feel free to read the paper directly.)