# MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation,MoEBERT, by Georgia Institute of Technology, and Microsoft,2022 NAACL(Sik-Ho Tsang @ Medium)

Language Model[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet]

1991 … 20222023[GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B]

==== My Other Paper Readings Are Also Over Here ====

**MoEBERT**is proposed, which uses a**Mixture-of-Experts (MoEs)****increase model capacity and inference speed**.- MoEBERT is initialized by
**adapting the feed-forward neural networks in a pre-trained model into multiple experts**. - As such, representation power of the pre-trained model is largely retained.
**During inference, only one of the experts is activated**, such that**speed can be improved.**

# Outline

**MoEBERT****Results**

**1. MoEBERT**

## 1.1. Important Score

- There are
**some columns in**(correspondingly some rows in*W*1*W*2)**contribute more**than the others to model performance. - The
**importance score**, originally introduced in model pruning literature,**measures such parameter importance.** - For a
**dataset**with*D***sample pairs {(**, the score is defined as:*x*,*y*)}

- where
*w*1*j*is the j-th column of*W*1,*w*2*j*is the*j*-th row of*W*2, and*L*(*x*,*y*) is the loss.

The idea is to

share the most important columns benefits model performance.Based on this finding, thetop-and the FFN is adapted intoscolumns are sharedNexperts.

- e.g.: Blue neuron: Important and shared among experts.

## 1.2. Distillation

- BERT is used as teacher and student.
- For the Transformer layers, the
**distillation loss**is the**mean squared error between the teacher’s layer output**Concretely, for an input*Xltea*and the student’s layer output*Xl*.*x*, the Transformer layer distillation loss is:

- where
*L*is the total number of layers. - For
**prediction layer**,**KL divergence**is used for distillation loss:

- The layer-wise distillation loss is the sum of the above losses:

- Given the training dataset
*D*and samples {(*x*,*y*)}, the training objective is:

- where CE is cross entropy loss.

## 1.3. Model

# 2. Results

## 2.1. GLUE

MoEBERT

outperformsall of the baseline methods in6/7 tasks.

## 2.2. Question Answering

MoEBERT

significantly outperforms allof the baseline methods in terms of both evaluation metrics:exact match (EM) and F1.

## 2.3. Inference Speed

**DistilBERT****shallower**model, i.e., it only has**6 layers instead of 12 layers**; whereas**MoEBERT**is a**narrower**model, i.e., the**hidden dimension is 768 instead of 3072.**

The speed of MoEBERT is

slightly slower thanDistilBERT, butsignificantly faster thanBERT.