Brief Review — GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

GLaM, A 1.2T-Model-Size Sparse Model, Using Mixture-of-Experts (MoE)

Sik-Ho Tsang
4 min readJan 23, 2024
GLaM, Outperforms GPT-3

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
GLaM
, by Google
2022 ICML, Over 260 Citations (Sik-Ho Tsang @ Medium)

Large Language Model (LLM)
2020 … 2023
[GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [Flan 2022, Flan-T5]
==== My Other Paper Readings Are Also Over Here ====

  • GLaM (Generalist Language Model) is proposed, which uses a sparsely activated mixture-of-experts (MoE) architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.

Outline

  1. GLaM
  2. Results

1. GLaM

1.1. Model Architecture

GLaM: Model Architecture

1.1. Experts

The feed-forward component of every other Transformer layer is replaced with an MoE layer, as shown in Figure 2.

  • Each MoE layer consists of a collection of independent feed-forward networks as the ‘experts’.
  • A gating function then uses a softmax activation function to model a probability distribution over these experts.

In this paper, only the best 2 experts are activated.

1.2. Addtional Modifications

  • The standard positional embedding is replaced with the per-layer relative positional bias from Transformer-XL.
  • In the non-MoE Transformer feed-forward sub-layers, the first linear projection and the activation function are replaced with the Gated Linear Unit (GLU), which computes the component-wise product of two linear transformation of the input, followed by a Gaussian Error Linear Unit (GELU) activation function.
  • The weights and computation of large GLaM models are partitioned using the 2D sharding algorithm as described in Xu et al. (2021).
  • On top of the standard cross-entropy loss, the MoE auxiliary loss is added as described in GShard (Lepikhin et al., 2021) with a 0.01 coefficient to encourage expert load balancing.

1.3. Training Data

Training Data

A high-quality dataset of 1.6 trillion tokens is built that are representative of a wide range of natural language use cases.

1.4. Model Size Comparisons

Model Size Comparisons
Different Scale GLaM Models

Different scale GLaM models are designed ranging from 130 million parameters to 1.2 trillion parameters.

  • E is the number of experts in the MoE layer, B is the mini-batch size, S is the input sequence length, M is the model and embedding dimension, H is the hidden dimension of the feed-forward network, L is the number of layers and N is the number of total devices.
  • Additionally, nparams is the total number of trainable model parameters, nact-params is the number of activated model parameters per input token.

2. Results

2.1. GLaM vs GPT-3

On zero, one and few-shot learning, GLaM compares favorably to GPT-3 (175B), with significantly improved learning efficiency across 29 public NLP benchmarks.

  • Thanks to the sparsely activated architecture and the efficient implementation of the model parallelism algorithm, the total energy consumption during training is only one third of GPT-3’s.

2.2. Sparse vs Dense

Sparse vs Dense

GLaM MoE models perform consistently better than GLaM dense models for similar effective FLOPs per token.

2.3. Data Efficiency & Computational Efficiency

Data Efficiency & Computational Efficiency

GLaM MoE models require significantly less data than dense models of comparable FLOPs to achieve similar zero, one, and few-shot performance.

Training sparsely activated models takes much less computational resources than training dense models.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.