Sitemap

Review — DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Using Enhanced MoE in DeepSeek

4 min readJul 27, 2025

--

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE
, by Peking University, DeepSeek-AI, Tsinghua University, Nanjing University
2024 ACL, Over 410 Citations (

@ Medium)

Large Langauge Model (LLM)
2020 …
2023 [GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [Flan 2022, Flan-T5] [AlphaCode 2] [Mistral 7B] [Alpaca] [Inflection-1] 2024 [Nemotron-4 15B] [DeepSeek-v1]
==== My Other Paper Readings Are Also Over Here ====

Outline

  1. DeepSeekMoE
  2. Results

1. DeepSeekMoE

Press enter or click to view image in full size
DeepSeekMoE

1.1. Transformer

1.2. MoE

  • A typical practice to construct an MoE language model usually substitutes Feed-Forward Networks (FFNs) in a Transformer with MoE layers at specified intervals.
  • An MoE layer is composed of multiple experts, where each expert is structurally identical to a standard FFN. Then, each token will be assigned to a few experts. If the l-th FFN is substituted with an MoE layer, its computation can be expressed as:

1.2. Fine-Grained Expert Segmentation

  • However, if each token can be routed to more experts, diverse knowledge will gain the potential to be decomposed and learned in different experts respectively, where each expert can still remain specialized and focused.

Thus, as in Fig. 1(b), each expert FFN is segmented into m smaller experts by reducing the FFN intermediate hidden dimension to 1/m times its original size. The output of an MoE layer can be expressed as:

Press enter or click to view image in full size
  • where the number of expert parameters is equal to N times a standard FFN, and mN denotes the number of fine-grained experts. Also, the number of nonzero gates will increase to mK.
  • Consider the case where N = 16, a typical top-2 routing strategy can yield (16 2) = 120 possible combinations. By contrast, if each expert is split into 4 smaller experts, we can yield (64 8) = 4,426,165,368 potential combinations.
  • The surge in combinatorial flexibility enhances the potential for achieving more accurate and targeted knowledge acquisition.

1.3. Shared Expert Isolation

  • If there are shared experts that capture and consolidate common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated.
  • Thus, as in Fig. 1(c), Ks experts are further isolated as shared experts.

2. Results

2.1. DeepSeekMoE 2B

DeepSeekMoE 2B
  1. With more total parameters, Hash Layer and Switch Transformer achieve significantly stronger performance than the dense baseline with the same number of activated parameters.
  2. Compared with Hash Layer and Switch Transformer, GShard has more activated parameters and achieves slightly better performance.
  3. With the same number of total and activated parameters, DeepSeekMoE demonstrates overwhelming advantages over GShard.

2.2. DeepSeekMoE 16B

DeepSeekMoE 16B
  1. With about only 40% of the computations, DeepSeekMoE 16B achieves comparable performance with LLaMA2 7B and DeepSeek 7B.
  2. DeepSeekMoE 16B exhibits notable strengths in language modeling and knowledge-intensive tasks such as Pile, HellaSwag, and TriviaQA.
  3. Compared with the excellent performance on other tasks, DeepSeekMoE exhibits limitations in addressing multiple-choice tasks, which may stem from the limited attention parameters in DeepSeekMoE 16B.
  4. Compared with LLaMA2 7B, DeepSeek 7B and DeepSeekMoE 16B have much stronger performance on math, coding, and Chinese benchmarks.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet