Review — MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

MiniLM, for Compressing Language Model

Sik-Ho Tsang
5 min readSep 8, 2024

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers
MiniLM
, by Microsoft Research
2020 NeurIPS, Over 1000 Citations (Sik-Ho Tsang @ Medium)

Language Model
2007 … 2022
[GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] [LinkBERT, BioLinkBERT] [AlphaCode] [Block-wise Dynamic Quantization] 2023 [ERNIE-Code] [Grouped-Query Attention (GQA)]
==== My Other Paper Readings Are Also Over Here ====

  • MiniLM is proposed to compress large Transformer-based pre-trained models, termed as deep self-attention distillation.
  • The small model (student) is trained by deeply mimicking the self-attention module by distilling the self-attention module of the last Transformer layer of the teacher.
  • The scaled dot-product between values in the self-attention module is introduced as the new deep self-attention knowledge, in addition to the attention distributions.

Outline

  1. Comparisons with Prior Arts
  2. MiniLM
  3. Results

1. Comparisons with Prior Arts

1.1. Preliminaries

  • Knowledge distillation (KD) is to train the small student model S on a transfer feature set with soft labels and intermediate representations provided by the large teacher model T.
  • where D denotes the training data, fS(.) and fT(.) indicate the features of student and teacher models respectively, L(.) represents the loss function.

1.2. Prior Arts

Comparisons with Prior Arts
  • For prior arts, DistilBERT, TinyBERT and MobileBERT, as shown above, they have some limitations for model distillation, e.g.: they need layer-to-layer distillation, or some number of layers/hidden size on student model.

MiniLM does not impose such limitations or constraints for model distillation.

2. MiniLM

Proposed MiniLM

2.1. Self-Attention Distribution Transfer

MiniLM minimizes the KL-divergence between the self-attention distributions of the teacher and student:

  • Where |x| and Ah represent the sequence length and the number of attention heads. L and M represent the number of layers for the teacher and student. ATL and ASM are the attention distributions of the last Transformer layer for the teacher and student, respectively.

Different from previous works which transfer teacher’s knowledge layer-to-layer, only the attention maps of the teacher’s last layer are used.

Distilling knowledge of the last Transformer layer also allows more flexibility for the number of layers of our student models, avoids the effort of finding the best layer mapping.

2.2. Self-Attention Value-Relation Transfer

  • The knowledge of queries and keys is transferred via attention distributions.

The value relation is computed via the multi-head scaled dot-product between values. The KL-divergence between the value relation of the teacher and student is used as the training objective:

  • Where VTL,a and VSM,a are the values of an attention head in self-attention module for the teacher’s and student’s last layer.
  • VRTL and VRSM are the value relation of the last Transformer layer for teacher and student, respectively.

The final training loss is computed via summing the attention distribution and value-relation transfer losses.

2.3. Teacher Assistant

Combining deep self-attention distillation with a teacher assistant brings further improvements for smaller student models.

2.4. Setup

  • BERTBASE is used as teacher. English Wikipedia and BookCorpus are used as pretraining data.
  • For the training of multilingual MiniLM, XLM-RBASE is used as teacher.

3. Results

3.1. SOTA Comparisons

Performance on SQuAD 2.0 and GLUE

MiniLM outperforms DistilBERT, TinyBERT and two BERT baselines across most tasks.

Performance for Smaller student models

MiniLM with TA outperforms soft label distillation and TinyBERT on the three tasks.

3.2. Ablation Studies

Left: Parameters and Time, Right: Vale-Rel Transfer
  • Table 4: 6-layer 768-dimensional student model is 2.0× faster than BERTBASE, while retaining more than 99% performance on a variety of tasks, such as SQuAD 2.0 and MNLI.
  • Table 5: Distilling the fine-grained knowledge of value relation helps the student model deeply mimic the self-attention behavior of the teacher, which further improves model performance.
  • Table 6: Using value relation achieves better performance. Specifically, our method brings about 1.0% F1 improvement on the SQuAD benchmark.
  • Table 7: Using the last layer achieves better results.

3.3. Multilingual MiniLM Results

Left: XNLI, Right: MLQA
  • Table 8: MiniLM achieves competitive performance on XNLI with much fewer Transformer parameters. 12×384 MiniLM compares favorably with mBERT and XLM trained on the MLM objective.
  • Table 9: 12×384 MiniLM performs competitively better than mBERT and XLM. 6-layer MiniLM also achieves competitive performance.

3.4. Analysis

Clustering Representation for Analysis
  • MiniLM distilled from teacher’s last Transformer layer also learns the phrase-level information well. Moreover, lower layers of MiniLM also encode phrasal information better than higher layers.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.