Brief Review — mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

mT5, Multilingual Version of T5

3 min readApr 29, 2023

--

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,
mT5, by Google Research
2021 NAACL, Over 800 Citations (Sik-Ho Tsang @ Medium)
Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

mT5, a multilingual variant of T5, is proposed that was pre-trained on a new Common Crawl-based dataset covering 101 languages.

Outline

T5
mT5
Results

1. T5

T5 is pretrained using English corpus only.

2. mT5

2.1. Model Architecture

mT5 is based on on the “T5.1.1” recipe, which improves upon T5 by using GeGLU nonlinearities, scaling both dmodel and dff instead of just dff in the larger models.
“span-corruption” objective pre-training is done, as the same in T5 on unlabeled data only with no Dropout.

5 model variants are trained.

2.2. Datasets

71 monthly web scrapes released so far by Common Crawl are used. This is dramatically more source data than was used for C4. (With filtering and removal.)

With such large dataset, there can be better coverage of tail langs.

To have balancing the sample better, α is introduced to boost the tail:

α=0.3 is used in the final model, which gives a reasonable compromise between performance on high- and low-resource languages.

Compared to T5, vocabulary size is increased to 250,000 wordpieces, using SentencePiece.

3. Results

**Results on XTREME sentence-pair classification, structured prediction and question answering tasks.**

The largest model mT5-XXL exceeds state-of-the-art on all classification and QA tasks and is near SOTA on NER (69.2 vs. 70.1).

**Comparison of** T5 vs. mT5 on SQuAD question answering (F1/EM).

Close the gap between T5 and mT5 when model gets larger.

**Average F1 on the TyDi QA GoldP task across languages.**

Performance improves with increasing model capacity.

The importance of in-language training data (whether gold In-Language Multitask or synthetic Translate-Train) decreases with model scale, as seen by Zero-Shot closing the quality gap.

There are also other newer T5/mT5 variants developed, like nmT5, Switch Transformer, ByT5.

Brief Review — mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

mT5, Multilingual Version of T5

Outline

1. T5

2. mT5

2.1. Model Architecture

2.2. Datasets

3. Results

Written by Sik-Ho Tsang

No responses yet