Member-only story

Review — X-MoE: On the Representation Collapse of Sparse Mixture of Experts

X-MoE, Analyze & Solve Collapse Problem of Sparse MoE

Sik-Ho Tsang
6 min readJul 16, 2023

On the Representation Collapse of Sparse Mixture of Experts,
X-MoE, by Beijing Institute of Technology,Microsoft Corporation, and Peking University
2022 NeurIPS (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] [MoEBERT] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B] [UL2]
==== My Other Paper Readings Are Also Over Here ====

  • Learning a routing mechanism in Sparse MoE encourages token clustering around expert centroids, implying a trend toward representation collapse.
  • In this work, X-MoE proposes to estimate the routing scores between tokens and experts on a low-dimensional hypersphere, alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.

Outline

  1. Collapse Issue in MoE
  2. X-MoE
  3. Results

1. Collapse Issue in MoE

1.1. Sparse MoE (SMoE)

  • For the input token x with its hidden representation h, the router computes the routing score between h and the i-th expert by a dot-product similarity metric si = h·ei, where ei is a learnable expert embedding, and d is the hidden size of the model.
  • Then, the router utilizes a sparse gating function g(r) to make the expert network conditionally activated. In this paper, authors focus on top-1 routing. Formally, considering a SMoE layer with N expert:
  • where fFFNk() stands for the k-th expert network that is implemented as stacked feed-forward networks.
  • For gating function g(sk), it can be softmax or sigmoid gating:

1.2. Representation Collapse of SMoE

  • For convenience, h’=fSMoE(h) is used to denote the output of the SMoE layer. Sk = g(sk) is the k-th output of the softmax function, and hFFN=fFFNk(h) to denote the output of the k-th expert network. The Jacobian matrix with respect to h is given by:
  • where δkj is a Kronecker delta

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet