Review — TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT, Outperforms , Much Smaller Than

Sik-Ho Tsang
7 min readJun 25, 2022

--

TinyBERT two-stage learning framework

TinyBERT: Distilling BERT for Natural Language Understanding
TinyBERT, by Huazhong University of Science and Technology, Huawei Noah’s Ark Lab, and Huawei Technologies Co., Ltd.
2020 EMNLP, Over 600 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, , ,

  • A new two-stage learning framework is proposed for TinyBERT, which performs at both the pretraining and task-specific learning stages.

Outline

  1. Preliminaries
  2. TinyBERT Knowledge Losses
  3. TinyBERT Knowledge Stages
  4. Experimental Results

1. Preliminaries

1.1. Layer

  • A standard layer includes two main sub-layers: multi-head attention (MHA) and fully connected feed-forward network (FFN).
  • For MHA, there are three components of queries, keys and values, denoted as matrices Q, K and V respectively. The attention function can be formulated as follows:
  • Multi-head attention is defined by concatenating the attention heads from different representation subspaces as follows:
  • FFN contains two linear transformations and one ReLU activation:

1.2. Knowledge (KD)

  • Knowledge (KD) aims to transfer the knowledge of a large teacher network T to a small student network S.
  • The student network is trained to mimic the behaviors of teacher networks. fT and fS denote the behavior functions of teacher and student networks, respectively. KD can be modeled as minimizing the following objective function:
  • where x is the text input and X denotes the training dataset.

2. TinyBERT Knowledge Losses

2.1. Problem Formulation

M out of N layers from the teacher model are chosen for the -layer .

m-th layer of student model learns the information from the g(m)-th layer of teacher model.

  • Formally, the student can acquire knowledge from the teacher by minimizing the following objective:
  • where Llayer refers to the loss function of a given model layer (e.g., layer or embedding layer), fm(x) denotes the behavior function induced from the m-th layers and λm is the hyperparameter that represents the importance of the m-th layer’s .

2.2. -Layer

-layer consists of Attnloss (attention based ) and Hidnloss (hidden states based )

2.2.1. Attention Based

  • Attention weights learned by can capture rich linguistic knowledge, which includes the syntax and coreference information. It is essential for natural language understanding. MSE is used for the attention based so that the linguistic knowledge can be transferred from teacher () to student (TinyBERT):
  • where h is the number of attention heads.

2.2.2. Hidden States Based

  • The knowledge from the output of layer is also distilled, and the objective is to minimize the hidden states using MSE as follows:
  • Since d’ is often smaller than d where d and d’ which are the hidden sizes of teacher and student models respectively, the matrix Wh is a learnable linear transformation, which transforms the hidden states of student network into the same space as the teacher network’s states.

2.3. Embedding-Layer

  • where the matrices ES and ET refer to the embeddings of student and teacher networks, respectively. The matrix We is a linear transformation playing a similar role as Wh.

2.4. Prediction-Layer

  • Knowledge is used to fit the predictions of teacher model as in . Specifically, the soft cross-entropy loss is used between the student network’s logits against the teacher’s logits:
  • where zS and zT are the logits vectors predicted by the student and teacher respectively, CE means the cross entropy loss, and t means the temperature value. t=1 in this paper.

2.5. Overall

  • To unify, the loss of the corresponding layers between the teacher and the student network is:

3. TinyBERT Knowledge Stages

  • The application of usually consists of two learning stages: the pre-training and fine-tuning, including the general and the task-specific .

3.1. General

  • General helps TinyBERT learn the rich knowledge embedded in pre-trained .
  • The original without fine-tuning is used as the teacher and a large-scale text corpus is used as the training data. The but without prediction-layer , is performed.

However, due to the significant reductions of the hidden/embedding size and the layer number, general TinyBERT performs generally worse than .

3.2. Task-Specific

  • The task-specific further teaches TinyBERT the knowledge from the fine-tuned . The proposed is re-performed on an augmented task-specific dataset.
  • Specifically, the fine-tuned is used as the teacher and a data augmentation method is proposed to expand the task-specific training set.
  • Training with more task-related examples, the generalization ability of the student model can be further improved.

3.3. Data Augmentation

Data Augmentation Procedure for Task-Specific
  • Pre-trained language model and GloVe word embeddings are combined to do word-level replacement for data augmentation.
  • Specifically, the language model is used to predict word replacements for single-piece words, and the word embeddings are used to retrieve the most similar words as word replacements for multiple-pieces words.
  • pt=0.4, Na=20, K=15 are used in the above algorithm.

4. Experimental Results

4.1. TinyBERT Variants

  • TinyBERT4: A tiny student model (the number of layers M=4, the hidden size d’=312, the feedforward/filter size di=1200 and the head number h=12) that has a total of 14.5M parameters.
  • BASE (N=12, d=768, di=3072 and h=12) is used as the teacher model that contains 109M parameters.
  • TinyBERT6 (M=6, d’=768, di=3072 and h=12) with the same architecture as BERT6-PKD (Sun et al., 2019) and 6.

4.2. Results on

Results are evaluated on the test set of official benchmark
  • There is a large performance gap between TINY (or SMALL) and BASE due to the dramatic reduction in model size.
  • TinyBERT4 is consistently better than TINY on all the tasks and obtains a large improvement of 6.8% on average.
  • TinyBERT4 significantly outperforms the 4-layer state-of-the-art KD baselines (i.e., BERT4-PKD and 4) by a margin of at least 4.4%, with 28% parameters and 3.1× inference speedup.
  • Compared with the teacher BASE, TinyBERT4 is 7.5× smaller and 9.4× faster in the model efficiency, while maintaining competitive performances.
  • TinyBERT is also compared with the 24-layer TINY, which is distilled from 24-layer IB-LARGE. The results show that TinyBERT4 achieves the same average score as the 24-layer model with only 38.7% FLOPs.
  • When we the capacity of the model is increased to TinyBERT6, its performance can be further elevated and outperforms the baselines of the same architecture by a margin of 2.6% on average and achieves comparable results with the teacher.

4.3. Effects of Learning Procedure

Ablation studies of different procedures

The results indicates that all of the three procedures are crucial for the proposed method.

4.4. Effects of Objective

Ablation studies of different objectives in the TinyBERT learning

It is shown that all the proposed objectives are useful. The performance w/o Trm drops significantly from 75.6 to 56.3.

4.5. Effects of Mapping Function

Results (dev) of different mapping strategies for TinyBERT4
  • n=g(m): The effects of different mapping functions
  • The original TinyBERT uses the uniform strategy, and two typical baselines including top-strategy (g(m)=m+N-M; 0<mM) and bottom-strategy (g(m)=m; 0<mM), are compared.

The uniform strategy covers the knowledge from bottom to top layers of BASE, and it achieves better performances.

  • (There are also other results in the appendix, please feel free to read the paper directly.)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Responses (1)

Write a response