Brief Review — DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
DeepSpeed-MoE for Fast Inference & Training
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
DeepSpeed-MoE, by Microsoft
2022 ICML, Over 60 Citations (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] [X-MoE] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2]
==== My Other Paper Readings Are Also Over Here ====
- MoE can provide training cost reduction. Yet, how to provide fast MoE model inference remains challenging and unsolved.
- DeepSpeed-MoE is proposed, which is an end-to-end MoE training and inference solution, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7×, and a highly optimized inference system that provides 7.3× better latency and cost; and also up to 4.5× faster and 9× cheaper inference compared to quality-equivalent dense models.
Outline
- MoE
- Proposed Pyramid-Residual-MoE (PR-MoE)
- Proposed Mixture-of-Student (MoS)
- DeepSpeed-MoE
1. MoE
1.1. Model
- MoE architecture for the GPT-3 like NLG model is used.
- 350M/1.3B/6.7B (24/24/32 layers, 1024/2048/4096 hidden size, 16/16/32 attention heads). e.g.: “350M+MoE-128” means a MoE model that uses 350M dense model as the base model with 128 experts.
- Top-1 gating is used, i.e. single expert activation.
- 128 NVIDIA Ampere A100 GPUs are used.
- 300B tokens are used.
The validation loss of the MoE models is significantly better than their dense counter parts (e.g., 1.3B+MoE-128 versus 1.3B dense).
In addition, MoE models are on par with the validation loss of the dense models with 4–5× larger base (e.g., 1.3B+MoE-128 versus 6.7B dense).
The model quality is also on par in terms of the zero-shot evaluation.
- A 1.3B+MoE-128 model requires roughly the same amount of training compute as 1.3B dense model, while offering much better model quality.
- Furthermore, the 1.3B+MoE-128 model can achieve the model quality of the 6.7B dense model at the training cost of 1.3B parameter dense model, resulting in a 5× training compute reduction.
2. Proposed Pyramid-Residual-MoE (PR-MoE)
2.1. Do All the Layers Learn the Same Representation?
- This question has been well-studied in Computer Vision (CV): shallow layers (close to inputs) learn general representations and deep layers (close to outputs) learn more objective specific representations.
- Two different Half-MoE architectures are studied.
- First-Half-MoE: MoE layers in the first half layers of the model and leave
- the second half of layers identical to dense model
- Second-Half-MoE: MoE layers at the second half and dense at the first half.
As can be seen, Second-Half-MoE has significantly better performance than its counterpart. This confirms that not all MoE layers learn the same level of representations. Deeper layers benefit more from large number of experts.
2.2. Is There a Way to Keep the Training/Inference Efficiency While Getting Generalization Performance Gain?
- To improve the generalization performance of MoE models, there are two common methods: (1) increasing the number of experts with increased memory; (2) using Top-2 expert selection at the expense of slightly more computation (33%).
- Two different MoEs are studied.
- Top2-MoE: Doubling the capacity using Top2 expert, and
- Residual-MoE: Fixing one expert and varying the second expert across different experts. The main intuition is to treat the expert from MoE module as an error correction term.
The generalization performance of these two MoE is on-par with each other. Residual-MoE, is more than 10% faster than Top2-MoE due to the communication volume reduction.
2.3. PR-MoE
- Thus, the new architecture utilizes more experts in the last few layers as compared to previous layers.
- Residual-MoE architecture is used, where each token separately passes one fixed MLP module and one chosen expert.
PR-MoE uses much fewer parameters but achieves comparable accuracy as Standard-MoE models.
3. Proposed Mixture-of-Student (MoS)
- A general formulation of the KD loss is used to force the MoS to imitate the outputs from the teacher MoE as:
- which measures a weighted sum of the cross-entropy loss between predictions and the given hard label and the KL divergence loss between the predictions and the teacher’s soft label.
Left: However, while KD loss improves validation accuracy initially, it begins to hurt accuracy towards the end of training. Because the PR-MoE already reduces the capacity compared with the standard MoE, further reducing the depth of the model causes the student to have insufficient capacity, making it fall into the underfitting regime.
- Authors propose to gradually decay the impact from KD or stop KD early in the training process.
Right: Stopping KD at 400K steps gives the promised benefit of knowledge distillation: the student model now has a similar validation curve as the teacher.
- As shown in Table 2 above, MoS via staged KD achieves an average accuracy of 42.87 and 47.96, retaining 99.5% and 99.1% performance of the 350M (43.08) and 1.3B teacher model (48.37) despite having 12.5% fewer layers.
4. DeepSpeed-MoE
- DeepSpeed is mainly studied at Transformer only such as MT-NLG 530B, but not at MoE before.
- MoE inference performance depends on two main factors: the overall model size and the overall achievable memory bandwidth.
4.1. Expert, Tensor and Data Parallelism
Tensor parallelism is used for tensor-slicing (for non-expert parameters) and expert-slicing (for expert parameters) is to split individual parameters across multiple GPUs to leverage the aggregate memory bandwidth across GPUs.
To scale the non-expert computation to the same number of GPUs, we use data parallelism at no communication overhead.
4.2. Hierarchical All-to-All
Hierarchical tree-based algorithms are often used with communication collectives like all-reduce, broadcast, etc to reduce the number of communication hops.
This reduces the communication hops from O(p) to O(G + p/G), where G is the number of GPUs in a node and p is the total number of GPU devices.
- The above Figure 6 shows the design overview of this implementation. Despite the 2× increase in communication volume, this hierarchical implementation allows for better scaling for small batch sizes.
4.3. Parallelism Coordinated Communication
The data redundancy created by tensor-slicing is leveraged to limit the GPUs that participate in all-to-all. Since each GPU in tensor parallelism contains the same data, the all-to-all communication can be limited within GPUs with the same tensor-parallelism rank.
4.4. Kernel Optimizations
For MoE gating functions, sparse data structures are used instead of commonly used dense representations that contains cubic number of zeros and quadratic number of non-zeros Thus our approach reduces the compute complexity from cubic to quadratic.
- e.g.: The gating function is fused into a single kernel, and a dense token-to-expert mapping table is used to represent the assignment from tokens to experts, greatly reducing the kernel launch overhead.
- (If interested, please feel free to read the paper directly.)
4.5. Results
Both DeepSpeed-MoE and PyTorch reduce the inference latency as the number of GPUs are increased, as expected, although PyTorch is much slower compared to DeepSpeed-MoE.
DeepSpeed-MoE achieves up to 7.3× reduction in latency while achieving up to 7.3× higher throughput.
DeepSpeed-MoE can reduce the minimum number of GPUs required to perform inference and (2) further improve both latency and throughput.
A 52 billion-parameter DeepSpeed-MoE model (1.3B-MoE-128) is equivalent to a 6.7 billion-parameter dense model.
A 1.5 trillion-parameter MoE model is equivalent to a 175 billion-parameter dense model.