Brief Review — AlexaTM: Alexa Teacher Model

17M AlexaTM, By 2-Stage Pretraining and Distilling

4 min readApr 22, 2023

--

**Amazon Alexa** (Image from Pexel Anete Lusina)

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems,
AlexaTM, by Amazon, and Spotify,
2022 KDD, Over 40 Citations (Sik-Ho Tsang @ Medium)
Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

Large Language Model (LLM) has good AI performance, but with high latency, which is not good for Amazon Alexa application.
By pretraining and distilling, a 2.3B teacher model is distilled to a 170M intermediate student model.
By further distilling and fine-tuning, final student 17M model is obtained.
After this paper, later on, AlexaTM 20B is also developed. (Hope I can have time to read it later in the future.)

Outline

AlexaTM
Results

1. AlexaTM

1.1. Stage 1 Teacher

A large teacher is first pretrained using public data as Stage 1.
The teacher models are based on RoBERTa, but modified to use a pre-layernorm architecture.
Training was conducted using the masked language modeling (MLM) objective as used in BERT, in which 15% of tokens are masked, of which 10% are kept unchanged and 10% are replaced with a random token.
Mixed precision training is used.

Teacher models are trained with up to 9.3B non-embedding parameters, using DeepSpeed.

1.2. Stage 2 Teacher

Pretraining continues with Stage 2 in-house data to create a new teacher.

The goal was to improve the model’s specialization and ability to handle virtual assistant utterances, which are typically short and often ungrammatical.

1.3. Intermediate Student

Then, an intermediate student is distilled, starting with the Stage 1 teacher, and once converge, then using the Stage 2 teacher.

The intermediate student/teacher used the sum of categorical cross-entropy (MLM loss) and soft cross-entropy weighted equally.

1.4. Final Student

The intermediate student/teacher is then further trained on in-house unlabeled data before being distilled into the final student.

The loss is the same loss as the intermediate student one plus an additional usage of hidden-layer output matching, as in TinyBERT.
The final student in then fine-tuned on labeled data.

2. Results

2.1. Stage 1

**Correlation to XNLI accuracy from (a) perplexity and (b) mask-filling accuracy across model updates using 2.3B-parameter model.**

In order to monitor the progress of training, one standard approach is to measure perplexity on a held-out validation dataset. Authors also developed a mask-filling accuracy metric for measurement.

Both perplexity and mask-filling accuracy correlate strongly with XNLI performance across model update steps.

**Results on XNLI for the Stage 1 pretrained 2.3B- and 9.3B-parameter models, as well as the 170M-parameter model distilled from the Stage 1 2.3B-parameter model.**

The 2.3B-parameter and 9.3B-parameter models are competitive with comparably-sized public models, even though the model training set was 70% spoken-form data, whereas the public models and XNLI use written-form data.

English XNLI accuracy drops by 3 points after distillation from 2.3B non-embedding parameters to 170M parameters, as well as by 3.1 points for average zero-shot accuracy.

**No-fine-tune perplexity and noun mask-filling accuracy on spoken-form data only, macro-averaged across all languages.**

It is expected to see perplexity decrease and noun mask filling accuracy increase with increasing model sizes.

2.2. Stage 2

**(a) Full fine-tuning and (b) frozen-encoder results for the 2.3B-parameter Stage 2 model, the distilled 170M-parameter Stage 2 model, and the 17M-parameter Stage 2 model.**

A negative value indicates a reduced error rate versus the baseline 2.3B-parameter Stage 1 model.

Stage 2 domain-adaptive pretraining shows improved results on intent classification and slot filling tasks when compared with a model trained only on public data.

2.3. Distillation

**Exact match results for AlexaTM distilled models and** **DistilBERT** **versus** **XLM-R.**

Both of the distilled models outperform both public models on average. Most encouragingly, 17M-parameter model (improvement of 4.23% versus XLM-R) shows only minimal degradation versus our 170M-parameter model (improvement of 4.82% versus XLM-R).

**Results from a virtual assistant experimentation platform from 2 experiments (Exp)**

Models produced using the pretraining and distillation pipeline reduce overall user dissatisfaction by 3.74% to 4.91% and tail utterance dissatisfaction by 7.50% to 10.3% in the A/B test framework.

Brief Review — AlexaTM: Alexa Teacher Model

17M AlexaTM, By 2-Stage Pretraining and Distilling

Outline

1. AlexaTM

1.1. Stage 1 Teacher

1.2. Stage 2 Teacher

1.3. Intermediate Student

1.4. Final Student

2. Results

2.1. Stage 1

2.2. Stage 2

2.3. Distillation

Written by Sik-Ho Tsang

No responses yet