Review — BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model,
BLOOM, by BigScience,
2022 arXiv v2, Over 70 Citations (Sik-Ho Tsang @ Medium)
Large Language Model, LLM, Neural Machine Translation, NMT
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] 2023 [GPT-4]
2013 … 2021 [ResMLP] [GPKD] [Roformer] [DeLighT] 2022 [DeepNet] [PaLM]
==== My Other Paper Readings Are Also Over Here ====
- BLOOM is proposed, which is a 176B-parameter open-access language model designed and built thanks to BigScience, a collaboration of hundreds of researchers.
- A decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).
- It has 62 pages in total.
- BLOOM ROOTS Training & xP3 Prompted Datasets
- BLOOM Model & Training
1. BLOOM ROOTS Training & xP3 Prompted Datasets
- BLOOM’s development was coordinated by BigScience, an open research collaboration whose goal was the public release of an LLM.
- Over 1200 people were registered as participants in BigScience
1.2. ROOTS: Training Dataset
- The motivation of the above ROOTS corpus is to build a language model that was accessible to as many people as possible around the world, and also the size is comparable to the previous effort.
- Left: A treemap plot of the language families of all 46 natural languages where surface is proportional to the number of bytes. Indo-European and Sino-Tibetan families overwhelm the plot with a combined total of 1321.89 GB. The thin orange surface represents 18GB of Indonesian data and the green rectangle 0.4GB constituting the Niger-Congo language family subset.
- Right: A waffle plot of the distribution of the 13 programming languages by number of files, where one square represents approximately 30,000 files.
BLOOM is trained using ROOTS.
1.3. xP3: Prompted Dataset
- Multitask prompted finetuning (also referred to as instruction tuning) involves finetuning a pretrained language model on a training mixture composed of a large set of different tasks specified through natural language prompts.
- The original P3 dataset is extended to include new datasets in languages other than English and new tasks, such as translation. This resulted in xP3, a collection of prompts for 83 datasets covering 46 languages and 16 tasks.
After pretraining BLOOM, the massively multitask finetuning recipe is applied to equip BLOOM with multilingual zero-shot task generalization abilities, which results models as BLOOMZ.
2. BLOOM Model & Training
2.1. Model Architecture
- A causal decoder-only Transformer model is used with two architectural deviations.
- ALiBi Positional Embeddings are used, which directly attenuates the attention scores based on how far away the keys and queries are. It leads to smoother training and better downstream performance compared with the original Transformer and Rotary embeddings.
- Embedding Layer Norm is used immediately after the first embedding layer to avoid training instabilities.
- A vocabulary of 250k tokens is used. Byte-level BPE is used. This way, tokenization never results in unknown tokens.
- BLOOM was trained using Megatron-DeepSpeed, which consists of two parts: Megatron-LM provides the Transformer implementation, tensor parallelism, and data loading primitives, whereas DeepSpeed provides the ZeRO optimizer, model pipelining, and general distributed training components.
- Data parallelism (DP) replicates the model multiple times, with each replica placed on a different device and fed a slice of the data. The processing is done in parallel and all model replicas are synchronized at the end of each training step.
- Tensor parallelism (TP) partitions individual layers of the model across multiple devices. This way, instead of having the whole activation or gradient tensor reside on a single GPU, shards of this tensor are placed on separate GPUs.
- Pipeline parallelism (PP) splits up the model’s layers across multiple GPUs, so that only a fraction of the layers of the model are placed on each GPU.
- bfloat16 mixed precision is used. Fused CUDA kernel is used.
2.3. Model Variants
- 6 model variants are created, as shown above.
- The energy consumption of BLOOM is slightly higher than OPT, yet BLOOM’s emissions are approximately 2/3 less (25 tons versus 70 tons). This is thanks to the low carbon intensity of the energy grid used for training BLOOM, which emits 57 gCO2eq/kWh.
- Both BLOOM and OPT incurred significantly less carbon emissions than GPT-3, which can be attributed to several factors including more efficient hardware as well as less carbon-intensive energy sources.
- Prompts were developed prior to BLOOM’s release, and did not undergo any a priori refinement.
- Some prompt examples for machine translation (MT) are illustrated.
3.1. Zero-Shot Performance
The average performance across prompts always hovers around chance.
- The exception is the T0 model, which shows strong performance. However, this model is fine-tuned in the multitask setting, which cannot be directly compared.
In the zero-shot setting, MT results are generally very poor. The two major problems observed are (i) over-generation and (ii) not producing the correct language.
3.2. One-Shot Performance
- one-shot performance variability to SuperGLUE is reduced across all prompts and models.
- Overall, there is no notable improvement associated with the oneshot setting: models average accuracy is still nearly always at chance.
Both OPT and BLOOM model families improve slightly with scale, and there is no consistent difference between families across all tasks. BLOOM-176B is ahead of OPT-175B on Ax-b, CB and WiC.
The translation quality for many of the low-resource languages is good, comparable or even slightly better than the supervised M2M model.
BLOOM attains higher performance on multilingual summarization than OPT and that performance increases as the parameter count of the model increases.
3.4. Multitask Finetuning
Multilingual multitask finetuning, i.e. BLOOMZ, is used to improve the zero-shot performance of the BLOOM model.
3.5. Code Generation
- The performance of pretrained BLOOM models to be similar to that of the similar-sized GPT models trained on the Pile.
Yet, the Codex models, which have solely been finetuned on code, are significantly stronger than other models.
BLOOM’s overall prompt accuracy was close to .50, which suggests an overall absence of bias.
- There are many details/settings/experiments that I’ve mentioned yet, please feel free to read the paper directly if interested.
- As this paper is very long, I read it part by part using couple of days. During my reading, arXiv v3 was also published.