Brief Review — mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

mT5, Multilingual Version of T5

Sik-Ho Tsang
3 min readApr 29, 2023

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,
mT5, by Google Research
2021 NAACL, Over 800 Citations (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • mT5, a multilingual variant of T5, is proposed that was pre-trained on a new Common Crawl-based dataset covering 101 languages.


  1. T5
  2. mT5
  3. Results

1. T5

  • T5 is pretrained using English corpus only.

2. mT5

2.1. Model Architecture

mT5 is based on on the “T5.1.1” recipe, which improves upon T5 by using GeGLU nonlinearities, scaling both dmodel and dff instead of just dff in the larger models.

“span-corruption” objective pre-training is done, as the same in T5 on unlabeled data only with no Dropout.

Model Variants
  • 5 model variants are trained.

2.2. Datasets

mC4 Datasets

71 monthly web scrapes released so far by Common Crawl are used. This is dramatically more source data than was used for C4. (With filtering and removal.)

Language Distribution

With such large dataset, there can be better coverage of tail langs.

  • To have balancing the sample better, α is introduced to boost the tail:
Language Sampling
Different α

α=0.3 is used in the final model, which gives a reasonable compromise between performance on high- and low-resource languages.

  • Compared to T5, vocabulary size is increased to 250,000 wordpieces, using SentencePiece.

3. Results

XTREME Benchmark
Results on XTREME sentence-pair classification, structured prediction and question answering tasks.

The largest model mT5-XXL exceeds state-of-the-art on all classification and QA tasks and is near SOTA on NER (69.2 vs. 70.1).

Comparison of T5 vs. mT5 on SQuAD question answering (F1/EM).

Close the gap between T5 and mT5 when model gets larger.

Average F1 on the TyDi QA GoldP task across languages.

Performance improves with increasing model capacity.

  • The importance of in-language training data (whether gold In-Language Multitask or synthetic Translate-Train) decreases with model scale, as seen by Zero-Shot closing the quality gap.
Further Works on T5/mT5
  • There are also other newer T5/mT5 variants developed, like nmT5, Switch Transformer, ByT5.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.