Review — X-MoE: On the Representation Collapse of Sparse Mixture of Experts
X-MoE, Analyze & Solve Collapse Problem of Sparse MoE
6 min readJul 16, 2023
On the Representation Collapse of Sparse Mixture of Experts,
X-MoE, by Beijing Institute of Technology,Microsoft Corporation, and Peking University
2022 NeurIPS (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B] [UL2]
==== My Other Paper Readings Are Also Over Here ====
- Learning a routing mechanism in Sparse MoE encourages token clustering around expert centroids, implying a trend toward representation collapse.
- In this work, X-MoE proposes to estimate the routing scores between tokens and experts on a low-dimensional hypersphere, alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
Outline
- Collapse Issue in MoE
- X-MoE
- Results
1. Collapse Issue in MoE
1.1. Sparse MoE (SMoE)
- For the input token x with its hidden representation h, the router computes the routing score between h and the i-th…