# Review — X-MoE: On the Representation Collapse of Sparse Mixture of Experts

## X-MoE, Analyze & Solve Collapse Problem of Sparse MoE

6 min readJul 16

--

On the Representation Collapse of Sparse Mixture of Experts,X-MoE, by Beijing Institute of Technology,Microsoft Corporation, and Peking University2022 NeurIPS(Sik-Ho Tsang @ Medium)

Language Model[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT]

1991 … 20222023[GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B] [UL2]

==== My Other Paper Readings Are Also Over Here ====

- Learning a
**routing mechanism**in**Sparse****MoE****clustering around expert centroids**, implying**a trend toward representation collapse.** - In this work,
**X-MoE**proposes to**estimate the routing scores between tokens and experts on a low-dimensional hypersphere**,**alleviates the representation collapse issue**and achieves more consistent routing than the baseline mixture-of-experts methods.

# Outline

**Collapse Issue in****MoE****X-MoE****Results**

**1. Collapse Issue in ****MoE**

## 1.1. Sparse MoE (SMoE)

- For the
**input token**with its*x***hidden representation**, the router computes the*h***routing score between**·*h*and the*i*-th expert by a dot-product similarity metric*si*=*h*, where*ei**ei*is a learnable expert embedding, and*d*is the hidden size of the model. - Then, the router utilizes
**a sparse gating function**to*g*(*r*)**make the expert network conditionally activated.**In this paper, authors focus on**top-1 routing.**Formally, considering**a S****MoE****layer with**:*N*expert - where
stands for the*fFFNk*()*k*-th expert network - For gating function
, it can be*g*(*sk*)**softmax or sigmoid gating**: