Brief Review — GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
GLaM, A 1.2T-Model-Size Sparse Model, Using Mixture-of-Experts (MoE)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
GLaM, by Google
2022 ICML, Over 260 Citations (Sik-Ho Tsang @ Medium)Large Language Model (LLM)
2020 … 2023 [GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [Flan 2022, Flan-T5]
==== My Other Paper Readings Are Also Over Here ====
- GLaM (Generalist Language Model) is proposed, which uses a sparsely activated mixture-of-experts (MoE) architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
Outline
- GLaM
- Results
1. GLaM
1.1. Model Architecture
1.1. Experts
- Sparsely activated Mixture-of-Experts (MoE) is leveraged in GLaM.
The feed-forward component of every other Transformer layer is replaced with an MoE layer, as shown in Figure 2.
- Each MoE layer consists of a collection of independent feed-forward networks as the ‘experts’.
- A gating function then uses a softmax activation function to model a probability distribution over these experts.
In this paper, only the best 2 experts are activated.
1.2. Addtional Modifications
- The standard positional embedding is replaced with the per-layer relative positional bias from Transformer-XL.
- In the non-MoE Transformer feed-forward sub-layers, the first linear projection and the activation function are replaced with the Gated Linear Unit (GLU), which computes the component-wise product of two linear transformation of the input, followed by a Gaussian Error Linear Unit (GELU) activation function.
- The weights and computation of large GLaM models are partitioned using the 2D sharding algorithm as described in Xu et al. (2021).
- On top of the standard cross-entropy loss, the MoE auxiliary loss is added as described in GShard (Lepikhin et al., 2021) with a 0.01 coefficient to encourage expert load balancing.
1.3. Training Data
A high-quality dataset of 1.6 trillion tokens is built that are representative of a wide range of natural language use cases.
1.4. Model Size Comparisons
Different scale GLaM models are designed ranging from 130 million parameters to 1.2 trillion parameters.
- E is the number of experts in the MoE layer, B is the mini-batch size, S is the input sequence length, M is the model and embedding dimension, H is the hidden dimension of the feed-forward network, L is the number of layers and N is the number of total devices.
- Additionally, nparams is the total number of trainable model parameters, nact-params is the number of activated model parameters per input token.
2. Results
2.1. GLaM vs GPT-3
On zero, one and few-shot learning, GLaM compares favorably to GPT-3 (175B), with significantly improved learning efficiency across 29 public NLP benchmarks.
- Thanks to the sparsely activated architecture and the efficient implementation of the model parallelism algorithm, the total energy consumption during training is only one third of GPT-3’s.
2.2. Sparse vs Dense
GLaM MoE models perform consistently better than GLaM dense models for similar effective FLOPs per token.
2.3. Data Efficiency & Computational Efficiency
GLaM MoE models require significantly less data than dense models of comparable FLOPs to achieve similar zero, one, and few-shot performance.
Training sparsely activated models takes much less computational resources than training dense models.