# Brief Review — GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

## GLaM, A 1.2T-Model-Size Sparse Model, Using Mixture-of-Experts (MoE)

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts, by Google

GLaM2022 ICML, Over 260 Citations(Sik-Ho Tsang @ Medium)

Large Language Model (LLM)[GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [Flan 2022, Flan-T5]

2020 … 2023

==== My Other Paper Readings Are Also Over Here ====

**GLaM (Generalist Language Model)**is proposed, which**uses a sparsely activated****mixture-of-experts (MoE)****architecture**to scale the model capacity while also incurring**substantially****less training cost compared to dense variants.**

# Outline

**GLaM****Results**

**1. GLaM**

## 1.1. Model Architecture

## 1.1. Experts

**Sparsely activated****Mixture-of-Experts (MoE)**

The feed-forward component of every otherTransformerlayer is replaced with anMoElayer, as shown in Figure 2.

**Each****MoE****layer**consists of**a collection of independent feed-forward networks as the ‘experts’.****A gating function**then uses a**softmax**activation function to**model a probability distribution over these experts.**

In this paper,

only the best 2 experts are activated.

## 1.2. Addtional Modifications

- The
**standard positional embedding is replaced with the per-layer relative positional bias**from**Transformer-XL**. - In the non-MoE Transformer feed-forward sub-layers,
**the first linear projection and the activation function are replaced with the****Gated Linear Unit (GLU)**, which computes the component-wise product of two linear transformation of the input,**followed by a****Gaussian Error Linear Unit (GELU)** **The weights and computation**of large GLaM models are**partitioned**using the**2D sharding algorithm**as described in Xu et al. (2021).- On top of the standard
**cross-entropy loss**, the**MoE****auxiliary loss**is added as described in GShard (Lepikhin et al., 2021) with a 0.01 coefficient to**encourage expert load balancing.**

## 1.3. Training Data

A high-quality dataset of 1.6 trillion tokensis built that are representative of a wide range of natural language use cases.

## 1.4. Model Size Comparisons

Different scale GLaM modelsare designed rangingfrom 130 million parameters to 1.2 trillion parameters.

*E*is the number of experts in the MoE layer,*B*is the mini-batch size,*S*is the input sequence length,*M*is the model and embedding dimension,*H*is the hidden dimension of the feed-forward network,*L*is the number of layers and*N*is the number of total devices.- Additionally,
is*nparams***the total number of trainable model parameters**,is the*nact-params***number of activated model parameters per input token.**

# 2. Results

## 2.1. GLaM vs GPT-3

On zero, one and few-shot learning,

GLaM compares favorably toGPT-3(175B), with significantly improved learning efficiency across 29 public NLP benchmarks.

- Thanks to the sparsely activated architecture and the efficient implementation of the model parallelism algorithm, the
**total energy consumption during training is only one third of****GPT-3****’s.**

## 2.2. Sparse vs Dense

GLaMMoEmodels perform consistently better than GLaM dense modelsfor similar effective FLOPs per token.

## 2.3. **Data **Efficiency **& Computational **Efficiency

GLaMMoEmodels require significantly less data than dense models of comparable FLOPsto achieve similar zero, one, and few-shot performance.

Training sparsely activated modelstakesmuch less computational resourcesthan training dense models.