# Brief Review — DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

**DeepSpeed-MoE for Fast Inference & Training**

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale, by Microsoft

DeepSpeed-MoE2022 ICML, Over 60 Citations(Sik-Ho Tsang @ Medium)

Language Model[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] [X-MoE]

1991 … 20222023[GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2]

==== My Other Paper Readings Are Also Over Here ====

- MoE can provide training cost reduction. Yet, how to provide
**fast MoE model inference**remains**challenging**and**unsolved**. **DeepSpeed-MoE**is proposed, which is an end-to-end MoE training and inference solution, including novel**MoE architecture designs and model compression techniques**that**reduce MoE model size by up to 3.7×**, and a highly optimized inference system that provides**7.3× better latency and cost;**and also**up to 4.5× faster and 9× cheaper****inference**compared to quality-equivalent dense models.

# Outline

**MoE****Proposed Pyramid-Residual-MoE (PR-MoE)****Proposed Mixture-of-Student (MoS)****DeepSpeed-MoE**

**1. **MoE

## 1.1. Model

**MoE****GPT-3 like**NLG model is used.**350M/1.3B/6.7B**(24/24/32 layers, 1024/2048/4096 hidden size, 16/16/32 attention heads). e.g.:**“350M+****MoE****-128”**means a MoE model that uses**350M dense model**as the base model with**128 experts.****Top-1 gating**is used, i.e. single expert activation.**128 NVIDIA Ampere A100 GPUs**are used.**300B tokens**are used.

The validation loss of the MoE models is

significantly better than their dense counter parts(e.g., 1.3B+MoE-128 versus 1.3B dense).In addition, MoE models are

on par with the validation loss of the dense models with 4–5× larger base(e.g., 1.3B+MoE-128 versus 6.7B dense).

The

model qualityis alsoon parin terms of thezero-shotevaluation.

- A
**1.3B+****MoE****-128 model**requires**roughly the same amount of training compute as 1.3B dense model**, while offering much better model quality. - Furthermore, the 1.3B+MoE-128 model can achieve the model quality of the 6.7B dense model at the training cost of 1.3B parameter dense model, resulting in a
**5× training compute reduction**.

**2. Proposed Pyramid-Residual-MoE (PR-MoE)**

## 2.1. Do All the Layers Learn the Same Representation?

- This question has been well-studied in
**Computer Vision (CV): shallow layers**(close to inputs) learn**general representations**and**deep****layers**(close to outputs) learn**more objective specific representations**.

**Two different Half-****MoE****First-Half-****MoE**: MoE layers in the first half layers of the model and leave- the second half of layers identical to dense model
**Second-Half-****MoE**: MoE layers at the second half and dense at the first half.

As can be seen,

Second-Half-MoEhas significantly better performance than its counterpart.This confirms thatnot allMoElayers learn the same level of representations. Deeper layers benefit more from large number of experts.

## 2.2. Is There a Way to Keep the Training/Inference Efficiency While Getting Generalization Performance Gain?

- To improve the generalization performance of MoE models, there are two common methods: (1)
**increasing the number of experts with increased memory**; (2)**using Top-2 expert**selection at the expense of**slightly more computation (33%).** **Two different****MoE****s**are studied.**Top2-****MoE**:**Doubling the capacity using Top2 expert,**and**Residual-****MoE****: Fixing one expert**and varying the second expert across different experts. The main intuition is to**treat the expert from****MoE****module as an error correction term.**

The generalization performance of these two MoE is

on-par with each other. Residual-MoE, is more than 10% faster than Top2-MoEdue to thecommunication volume reduction.

## 2.3. PR-MoE

- Thus, the new architecture utilizes
**more experts in the last few layers**as compared to previous layers. - Residual-MoE architecture is used, where
**each token separately passes one fixed MLP module and one chosen expert.**

PR-MoEusesmuch fewer parametersbutachieves comparable accuracyasStandard-MoEmodels.

**3. Proposed Mixture-of-Student (MoS)**

- A general formulation of the
**KD loss**is used to force the MoS to imitate the outputs from the teacher**MoE**

- which measures a
**weighted sum**of the**cross-entropy loss between predictions and the given hard label**and the**KL divergence loss between the predictions and the teacher’s soft label.**

Left: However, whileKD loss improves validation accuracy initially, it begins tohurt accuracy towards the end of training. Because the PR-MoE already reduces the capacitycompared with the standard MoE, further reducing the depthof the model causes the student to have insufficient capacity, making it fall into theunderfittingregime.

- Authors propose to gradually decay the impact from KD or stop KD early in the training process.

Right: Stopping KD at 400K stepsgives thepromised benefitof knowledge distillation: the student model now has asimilar validation curve as the teacher.

- As shown in Table 2 above,
**MoS via staged KD**achieves an average accuracy of 42.87 and 47.96,**retaining 99.5% and 99.1% performance of the 350M (43.08) and 1.3B teacher model (48.37) despite having 12.5% fewer layers.**

**4. DeepSpeed-MoE**

- DeepSpeed is mainly studied at Transformer only such as MT-NLG 530B, but not at MoE before.
- MoE inference performance depends on
**two main factors**: the overall**model size**and the overall achievable**memory bandwidth**.

## 4.1. Expert, Tensor and Data Parallelism

Tensor parallelismis used fortensor-slicing (for non-expert parameters)andexpert-slicing (for expert parameters)is tosplit individual parameters across multiple GPUsto leverage the aggregate memory bandwidth across GPUs.To

scale the non-expert computationto the same number of GPUs, we usedata parallelismat no communication overhead.

## 4.2. Hierarchical All-to-All

Hierarchical tree-based algorithmsare often used with communication collectives like all-reduce, broadcast, etc toreduce the number of communication hops.

This reduces the communication hops from O(, wherep) to O(G+p/G)Gis the number of GPUs in a node andpis the total number of GPU devices.

- The above Figure 6 shows the design overview of this implementation. Despite the
**2× increase in communication volume**, this hierarchical implementation allows for**better scaling for small batch sizes.**

## 4.3. Parallelism Coordinated Communication

The data redundancy created by tensor-slicing is leveraged to limit the GPUs that participate in all-to-all. Since

each GPU in tensor parallelism contains the same data, theall-to-all communication can be limited within GPUs with the same tensor-parallelism rank.

## 4.4. Kernel Optimizations

For

MoEgatingfunctions,sparse data structuresare used instead of commonly used dense representations that contains cubic number of zeros and quadratic number of non-zeros Thus our approach reduces the compute complexity from cubic to quadratic.

- e.g.: The gating function is fused into a single kernel, and
**a dense token-to-expert mapping table**is used to represent**the assignment from tokens to experts**, greatly**reducing the kernel launch overhead.** - (If interested, please feel free to read the paper directly.)

## 4.5. Results

Both DeepSpeed-MoE and PyTorch reduce the inference latency as the

number of GPUs are increased, as expected, althoughPyTorch is much slower compared to DeepSpeed-MoE.

DeepSpeed-MoE achieves up to

7.3× reduction in latencywhile achievingup to 7.3× higher throughput.

DeepSpeed-MoE can reduce the minimum number of GPUsrequired to perform inference and (2)further improve both latency and throughput.

A 52 billion-parameter DeepSpeed-MoE model (1.3B-MoE-128) is equivalent to a 6.7 billion-parameter dense model.

A 1.5 trillion-parameter MoE model is equivalent to a 175 billion-parameter dense model.