MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation,
MoEBERT, by Georgia Institute of Technology, and Microsoft,
2022 NAACL (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] [WideNet] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloombergGPT] [GLM-130B]
==== My Other Paper Readings Are Also Over Here ====
- MoEBERT is proposed, which uses a Mixture-of-Experts (MoEs) structure (by G. E. Hinton), to increase model capacity and inference speed.
- MoEBERT is initialized by adapting the feed-forward neural networks in a pre-trained model into multiple experts.
- As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved.
Outline
- MoEBERT
- Results
1. MoEBERT
1.1. Important Score
- There are some columns in W1 (correspondingly some rows in W2) contribute more than the others to model performance.
- The importance score, originally introduced in model pruning literature, measures such parameter importance.
- For a dataset D with sample pairs {(x, y)}, the score is defined as:
- where w1j is the j-th column of W1, w2j is the j-th row of W2, and L(x, y) is the loss.
The idea is to share the most important columns benefits model performance. Based on this finding, the top-s columns are shared and the FFN is adapted into N experts.
- e.g.: Blue neuron: Important and shared among experts.
1.2. Distillation
- BERT is used as teacher and student.
- For the Transformer layers, the distillation loss is the mean squared error between the teacher’s layer output Xltea and the student’s layer output Xl. Concretely, for an input x, the Transformer layer distillation loss is:
- where L is the total number of layers.
- For prediction layer, KL divergence is used for distillation loss:
- The layer-wise distillation loss is the sum of the above losses:
- Given the training dataset D and samples {(x, y)}, the training objective is:
- where CE is cross entropy loss.
1.3. Model
2. Results
2.1. GLUE
MoEBERT outperforms all of the baseline methods in 6/7 tasks.
2.2. Question Answering
MoEBERT significantly outperforms all of the baseline methods in terms of both evaluation metrics: exact match (EM) and F1.
2.3. Inference Speed
- DistilBERT develops a shallower model, i.e., it only has 6 layers instead of 12 layers; whereas MoEBERT is a narrower model, i.e., the hidden dimension is 768 instead of 3072.
The speed of MoEBERT is slightly slower than DistilBERT, but significantly faster than BERT.