Brief Review — AlexaTM: Alexa Teacher Model
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems,
AlexaTM, by Amazon, and Spotify,
2022 KDD, Over 40 Citations (Sik-Ho Tsang @ Medium)
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====
- Large Language Model (LLM) has good AI performance, but with high latency, which is not good for Amazon Alexa application.
- By pretraining and distilling, a 2.3B teacher model is distilled to a 170M intermediate student model.
- By further distilling and fine-tuning, final student 17M model is obtained.
- After this paper, later on, AlexaTM 20B is also developed. (Hope I can have time to read it later in the future.)
1.1. Stage 1 Teacher
- A large teacher is first pretrained using public data as Stage 1.
- The teacher models are based on RoBERTa, but modified to use a pre-layernorm architecture.
- Training was conducted using the masked language modeling (MLM) objective as used in BERT, in which 15% of tokens are masked, of which 10% are kept unchanged and 10% are replaced with a random token.
- Mixed precision training is used.
Teacher models are trained with up to 9.3B non-embedding parameters, using DeepSpeed.
1.2. Stage 2 Teacher
- Pretraining continues with Stage 2 in-house data to create a new teacher.
The goal was to improve the model’s specialization and ability to handle virtual assistant utterances, which are typically short and often ungrammatical.
1.3. Intermediate Student
- Then, an intermediate student is distilled, starting with the Stage 1 teacher, and once converge, then using the Stage 2 teacher.
The intermediate student/teacher used the sum of categorical cross-entropy (MLM loss) and soft cross-entropy weighted equally.
1.4. Final Student
- The intermediate student/teacher is then further trained on in-house unlabeled data before being distilled into the final student.
The loss is the same loss as the intermediate student one plus an additional usage of hidden-layer output matching, as in TinyBERT.
The final student in then fine-tuned on labeled data.
2.1. Stage 1
- In order to monitor the progress of training, one standard approach is to measure perplexity on a held-out validation dataset. Authors also developed a mask-filling accuracy metric for measurement.
Both perplexity and mask-filling accuracy correlate strongly with XNLI performance across model update steps.
The 2.3B-parameter and 9.3B-parameter models are competitive with comparably-sized public models, even though the model training set was 70% spoken-form data, whereas the public models and XNLI use written-form data.
- English XNLI accuracy drops by 3 points after distillation from 2.3B non-embedding parameters to 170M parameters, as well as by 3.1 points for average zero-shot accuracy.
It is expected to see perplexity decrease and noun mask filling accuracy increase with increasing model sizes.
2.2. Stage 2
- A negative value indicates a reduced error rate versus the baseline 2.3B-parameter Stage 1 model.
Stage 2 domain-adaptive pretraining shows improved results on intent classification and slot filling tasks when compared with a model trained only on public data.
Both of the distilled models outperform both public models on average. Most encouragingly, 17M-parameter model (improvement of 4.23% versus XLM-R) shows only minimal degradation versus our 170M-parameter model (improvement of 4.82% versus XLM-R).
Models produced using the pretraining and distillation pipeline reduce overall user dissatisfaction by 3.74% to 4.91% and tail utterance dissatisfaction by 7.50% to 10.3% in the A/B test framework.