Brief Review — AlexaTM: Alexa Teacher Model

17M AlexaTM, By 2-Stage Pretraining and Distilling

Sik-Ho Tsang
4 min readApr 22


Amazon Alexa (Image from Pexel Anete Lusina)

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems,
AlexaTM, by Amazon, and Spotify,
2022 KDD, Over 40 Citations (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

Conceptual Pipeline
  • Large Language Model (LLM) has good AI performance, but with high latency, which is not good for Amazon Alexa application.
  • By pretraining and distilling, a 2.3B teacher model is distilled to a 170M intermediate student model.
  • By further distilling and fine-tuning, final student 17M model is obtained.
  • After this paper, later on, AlexaTM 20B is also developed. (Hope I can have time to read it later in the future.)


  1. AlexaTM
  2. Results

1. AlexaTM

The model training pipeline.

1.1. Stage 1 Teacher

  • A large teacher is first pretrained using public data as Stage 1.
  • The teacher models are based on RoBERTa, but modified to use a pre-layernorm architecture.
  • Training was conducted using the masked language modeling (MLM) objective as used in BERT, in which 15% of tokens are masked, of which 10% are kept unchanged and 10% are replaced with a random token.
  • Mixed precision training is used.

Teacher models are trained with up to 9.3B non-embedding parameters, using DeepSpeed.

1.2. Stage 2 Teacher

  • Pretraining continues with Stage 2 in-house data to create a new teacher.

The goal was to improve the model’s specialization and ability to handle virtual assistant utterances, which are typically short and often ungrammatical.

1.3. Intermediate Student

  • Then, an intermediate student is distilled, starting with the Stage 1 teacher, and once converge, then using the Stage 2 teacher.

The intermediate student/teacher used the sum of categorical cross-entropy (MLM loss) and soft cross-entropy weighted equally.

1.4. Final Student

  • The intermediate student/teacher is then further trained on in-house unlabeled data before being distilled into the final student.

The loss is the same loss as the intermediate student one plus an additional usage of hidden-layer output matching, as in TinyBERT.

The final student in then fine-tuned on labeled data.

2. Results

2.1. Stage 1

Correlation to XNLI accuracy from (a) perplexity and (b) mask-filling accuracy across model updates using 2.3B-parameter model.
  • In order to monitor the progress of training, one standard approach is to measure perplexity on a held-out validation dataset. Authors also developed a mask-filling accuracy metric for measurement.

Both perplexity and mask-filling accuracy correlate strongly with XNLI performance across model update steps.

Results on XNLI for the Stage 1 pretrained 2.3B- and 9.3B-parameter models, as well as the 170M-parameter model distilled from the Stage 1 2.3B-parameter model.

The 2.3B-parameter and 9.3B-parameter models are competitive with comparably-sized public models, even though the model training set was 70% spoken-form data, whereas the public models and XNLI use written-form data.

  • English XNLI accuracy drops by 3 points after distillation from 2.3B non-embedding parameters to 170M parameters, as well as by 3.1 points for average zero-shot accuracy.
No-fine-tune perplexity and noun mask-filling accuracy on spoken-form data only, macro-averaged across all languages.

It is expected to see perplexity decrease and noun mask filling accuracy increase with increasing model sizes.

2.2. Stage 2

(a) Full fine-tuning and (b) frozen-encoder results for the 2.3B-parameter Stage 2 model, the distilled 170M-parameter Stage 2 model, and the 17M-parameter Stage 2 model.
  • A negative value indicates a reduced error rate versus the baseline 2.3B-parameter Stage 1 model.

Stage 2 domain-adaptive pretraining shows improved results on intent classification and slot filling tasks when compared with a model trained only on public data.

2.3. Distillation

Exact match results for AlexaTM distilled models and DistilBERT versus XLM-R.

Both of the distilled models outperform both public models on average. Most encouragingly, 17M-parameter model (improvement of 4.23% versus XLM-R) shows only minimal degradation versus our 170M-parameter model (improvement of 4.82% versus XLM-R).

Results from a virtual assistant experimentation platform from 2 experiments (Exp)

Models produced using the pretraining and distillation pipeline reduce overall user dissatisfaction by 3.74% to 4.91% and tail utterance dissatisfaction by 7.50% to 10.3% in the A/B test framework.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.