Review — AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2seq Model

AlexaTM 20B, Outperforms PaLM 540B, GPT-3 175B, & BLOOM 175B, With Much Lower Carbon Footprint

Sik-Ho Tsang
6 min readApr 29


Amazon Alexa (Image from Pexels Jonathan Borba)

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2seq Model,
AlexaTM 20B, by Amzaon Alexa AI,
2022 arXiv v2 (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] 2023 [GPT-4]
Machine Translation
2013 2021 [ResMLP] [GPKD] [Roformer] [DeLighT] 2022 [DeepNet] [PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • AlexaTM 20B is proposed, which is a multilingual large-scale sequence-to-sequence (seq2seq) model, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks.
  • AlexaTM 20B is the first multilingual seq2seq model of this size with previous models having up to 11 billion parameters (mT5). It is also the first multilingual seq2seq model capable of in-context learning, since previous models only use denoising as pretext task.


  1. AlexaTM 20B Model, Dataset, Training
  2. Results

1. AlexaTM 20B Model, Dataset, Training

1.1. Model

AlexaTM 20B Model Architecture

Standard Transformer model architecture is used, with the small modification of moving the Layer Norms (both in the encoder and the decoder) to be located exactly at the beginning of each layer (right after the skip connection) instead of at the end of each layer (i.e.,Pre-LN), which improves the stability.

1.2. Training Dataset

Training Dataset

The pre-training data consists of Wikipedia and mC4 datasets.

  • The data in 12 languages is used, namely, Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu. The sequences of tokens are packed to produce a sequence of approximately 1024 subword units. Uncorrelated content uses special symbol [DOC] to separate.

1.3. Training Strategy

  • Frequency based upsampling which helps increase the representation of under-represented languages following XLM-R (Conneau et al. (2020)). In particular, sentences are sampled according to a multinomial distribution with probabilities (q1, q2, …, qN), where:
  • in which N is the total number of languages and ni is the total number of sentences in language i (we set α=0.5). Upsampling Wikipedia data (which has a higher quality) by 10 to be represented more in all data. Scaling to favor spoken format over written 7:3.
  • Subword tokenizer: 150K unigram SentencePiece model is used.

1.4. Training Objectives

Training Objectives

During pre-training the model is trained on the denoising task 80% and on the Causal Language Modeling (CLM) 20% of the time.

  • AlexaTM 20B model class is derived from BART class implementation in Huggingface.
  • To speed up the model training, the encoder is initialized by an internal 10B pre-trained encoder. During training, the encoder is frozen (for around 100k updates) then unfreeze afterwards.

1.4.1. Denoising

  • A denoising objective is used, in which 15% of the tokens are dropped in the input (in spans of length determined by a Poisson distribution with mean equal to 3) and expect the model to reconstruct the input. No mask tokens are used at the input during training: 1) to have the most consistency during pre-training, inference, and fine-tuning.

1.4.2. Causal Language Modeling (CLM) (i.e. PLM in T5)

  • For in-context learning, CLM is used.
  • In this task, the model is required to continue the input instead of denoising the input.

1.5. Training Infrastructure

  • AlexaTM 20B is trained for 120 days on 128 A100 GPUs for the total of 500k updates with the accumulated batch size of 2 million tokens.
  • DeepSpeed’s ZeRO Stage 3 (Rasley et al., 2020) is used to partition model weights, optimizer states, and gradients across all GPU workers, to obtain training throughput of up to 154 TFLOPS/GPU on 16 AWS p4d.24xlarge compute instances.

2. Results

  • AlexaTM 20B is evaluated using both zero/few-shot in-context learning as well as by fine-tuning the model on selected generation tasks. In all few-shot learning settings, greedy search is used.
  • (Please feel free to read paper for the details the processes of different settings.)

2.1. Multilingual Natural Language Generation

Multilingual Natural Language Generation

AlexaTM 20B performs better or in par to the largest dense decoder-only model to date (i.e., PaLM 540B) in summarization both in 1-shot and fine-tuning settings.

2.2. Machine Translation

Machine Translation

AlexaTM 20B provides the best scores in Flores 101 Machine Translation (MT) in almost all languages pairs.

2.3. Multilingual NLP Tasks

Multilingual NLP Tasks

AlexaTM 20B performs better or on par to XGLM 7.5B (Lin et al., 2021) across all tasks and languages.

2.4. English NLP Tasks

AlexaTM 20B is on par or better than GPT-3 175B parameter model across various English tasks in zero-shot (e.g., SuperGLUE). AlexaTM 20B outperforms recently released BLOOM 175B decoder-only model.

On SQuADv2, AlexaTM 20B performs better than GPT-3 175B but cannot reach to PaLM 540B. Authors think scale up may help.

2.5. Reasoning Tasks

Reasoning Tasks

There is no much gain for AlexaTM 20B as like in much larger models like GPT-3 175B show from such special prompts. The results indicate that scaling up the model parameteres is crucial in performing well in reasoning tasks.

2.6. Fairness, Bias, Toxicity

Fairness, Bias, Toxicity
  • In Table 14, on Winogender, the human performance on this task is 95.9%.

AlexaTM 20B achieves a new state-of-the-art of 82.63% in the zero-shot setting in the denoising mode.

  • In Table 15, an example for the female gender where the correct answer is nursing would be considered a “stereotypical” example since majority gender for the profession nursing is female. The “neutral” subset comprises of examples with gender-neutral pronouns (“they”, “their”, “them”).

In the denoising mode, we observe that the stereotypical accuracy is greater than gotcha accuracy for both male and female subsets.

  • Table 16 shows the top 10 most frequent unique descriptor words in response to prompt templates.

While there is no evidence of hate or bias against any religious group, it is observed that there is bias against the demographic group “Black”.

  • In Figure 3, on RealToxicityPrompts dataset, average Toxicity Probability of Continuation (TPC) as a function of binned Toxicity Probability of Prompt (TPP) is measured.

TPC increases with TPP, i.e. toxic prompts lead the model to generate more toxic continuations.

2.7. Carbon Footprint

Carbon Footprint
  • The carbon footprint of different models in tonnes of carbon dioxide equivalent (tCO2e) is shown.

As can be seen, despite matching or outperforming GPT-3 175B performance across different tasks, the AlexaTM 20B pre-training has 1/5th of GPT-3 carbon footprint. This points to another important factor in efficiency of AlexaTM 20B pre-training.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.