Review — MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
MiniLM, for Compressing Language Model
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
MiniLM, by Microsoft Research
2020 NeurIPS, Over 1000 Citations (Sik-Ho Tsang @ Medium)Language Model
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization] 2023 [ERNIE-Code] [Grouped-Query Attention (GQA)]
==== My Other Paper Readings Are Also Over Here ====
- MiniLM is proposed to compress large Transformer-based pre-trained models, termed as deep self-attention distillation.
- The small model (student) is trained by deeply mimicking the self-attention module by distilling the self-attention module of the last Transformer layer of the teacher.
- The scaled dot-product between values in the self-attention module is introduced as the new deep self-attention knowledge, in addition to the attention distributions.
Outline
- Comparisons with Prior Arts
- MiniLM
- Results
1. Comparisons with Prior Arts
1.1. Preliminaries
- Knowledge distillation (KD) is to train the small student model S on a transfer feature set with soft labels and intermediate representations provided by the large teacher model T.
- where D denotes the training data, fS(.) and fT(.) indicate the features of student and teacher models respectively, L(.) represents the loss function.
1.2. Prior Arts
- For prior arts, DistilBERT, TinyBERT and MobileBERT, as shown above, they have some limitations for model distillation, e.g.: they need layer-to-layer distillation, or some number of layers/hidden size on student model.
MiniLM does not impose such limitations or constraints for model distillation.
2. MiniLM
2.1. Self-Attention Distribution Transfer
MiniLM minimizes the KL-divergence between the self-attention distributions of the teacher and student:
- Where |x| and Ah represent the sequence length and the number of attention heads. L and M represent the number of layers for the teacher and student. ATL and ASM are the attention distributions of the last Transformer layer for the teacher and student, respectively.
Different from previous works which transfer teacher’s knowledge layer-to-layer, only the attention maps of the teacher’s last layer are used.
Distilling knowledge of the last Transformer layer also allows more flexibility for the number of layers of our student models, avoids the effort of finding the best layer mapping.
2.2. Self-Attention Value-Relation Transfer
- The knowledge of queries and keys is transferred via attention distributions.
The value relation is computed via the multi-head scaled dot-product between values. The KL-divergence between the value relation of the teacher and student is used as the training objective:
- Where VTL,a and VSM,a are the values of an attention head in self-attention module for the teacher’s and student’s last layer.
- VRTL and VRSM are the value relation of the last Transformer layer for teacher and student, respectively.
The final training loss is computed via summing the attention distribution and value-relation transfer losses.
2.3. Teacher Assistant
- Teacher Assistant is utilized to further improve the model performance of smaller students.
- Assuming the teacher model consists of L-layer Transformer with dh hidden size, the teacher is firstly distilled into a teacher assistant with L-layer Transformer and d’h hidden size.
- Then the teacher assistant is distilled to the smaller student model, which has M-layer Transformer with d’h hidden size.
Combining deep self-attention distillation with a teacher assistant brings further improvements for smaller student models.
2.4. Setup
3. Results
3.1. SOTA Comparisons
MiniLM outperforms DistilBERT, TinyBERT and two BERT baselines across most tasks.
MiniLM with TA outperforms soft label distillation and TinyBERT on the three tasks.
3.2. Ablation Studies
- Table 4: 6-layer 768-dimensional student model is 2.0× faster than BERTBASE, while retaining more than 99% performance on a variety of tasks, such as SQuAD 2.0 and MNLI.
- Table 5: Distilling the fine-grained knowledge of value relation helps the student model deeply mimic the self-attention behavior of the teacher, which further improves model performance.
- Table 6: Using value relation achieves better performance. Specifically, our method brings about 1.0% F1 improvement on the SQuAD benchmark.
- Table 7: Using the last layer achieves better results.
3.3. Multilingual MiniLM Results
- Table 8: MiniLM achieves competitive performance on XNLI with much fewer Transformer parameters. 12×384 MiniLM compares favorably with mBERT and XLM trained on the MLM objective.
- Table 9: 12×384 MiniLM performs competitively better than mBERT and XLM. 6-layer MiniLM also achieves competitive performance.
3.4. Analysis
- MiniLM distilled from teacher’s last Transformer layer also learns the phrase-level information well. Moreover, lower layers of MiniLM also encode phrasal information better than higher layers.