MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

MoEBERT, Distilling BERT to Narrower BERT with MoE

Sik-Ho Tsang
4 min readJun 24


MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation,
MoEBERT, by Georgia Institute of Technology, and Microsoft,
2022 NAACL (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B]
==== My Other Paper Readings Are Also Over Here ====

  • MoEBERT is proposed, which uses a Mixture-of-Experts (MoEs) structure (by G. E. Hinton), to increase model capacity and inference speed.
  • MoEBERT is initialized by adapting the feed-forward neural networks in a pre-trained model into multiple experts.
  • As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved.


  1. MoEBERT
  2. Results


1.1. Important Score

  • There are some columns in W1 (correspondingly some rows in W2) contribute more than the others to model performance.
  • The importance score, originally introduced in model pruning literature, measures such parameter importance.
  • For a dataset D with sample pairs {(x, y)}, the score is defined as:
  • where w1j is the j-th column of W1, w2j is the j-th row of W2, and L(x, y) is the loss.
Adapting Two Layers into Two Experts

The idea is to share the most important columns benefits model performance. Based on this finding, the top-s columns are shared and the FFN is adapted into N experts.

  • e.g.: Blue neuron: Important and shared among experts.

1.2. Distillation

  • BERT is used as teacher and student.
  • For the Transformer layers, the distillation loss is the mean squared error between the teacher’s layer output Xltea and the student’s layer output Xl. Concretely, for an input x, the Transformer layer distillation loss is:
  • where L is the total number of layers.
  • For prediction layer, KL divergence is used for distillation loss:
  • The layer-wise distillation loss is the sum of the above losses:
  • Given the training dataset D and samples {(x, y)}, the training objective is:
  • where CE is cross entropy loss.

1.3. Model

  • Number of experts in the MoE model to 4, and the hidden dimension of each expert is set to 768.
  • The top-512 important neurons are shared among the experts.
  • The number of effective parameters of the MoE model is 66M (v.s. 110M for BERT-base).

2. Results

2.1. GLUE


MoEBERT outperforms all of the baseline methods in 6/7 tasks.

2.2. Question Answering

Question Answering

MoEBERT significantly outperforms all of the baseline methods in terms of both evaluation metrics: exact match (EM) and F1.

2.3. Inference Speed

Inference Speed
  • DistilBERT develops a shallower model, i.e., it only has 6 layers instead of 12 layers; whereas MoEBERT is a narrower model, i.e., the hidden dimension is 768 instead of 3072.

The speed of MoEBERT is slightly slower than DistilBERT, but significantly faster than BERT.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.