Brief Review — mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
mT5, Multilingual Version of T5
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,
mT5, by Google Research
2021 NAACL, Over 800 Citations (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====
- mT5, a multilingual variant of T5, is proposed that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
Outline
- T5
- mT5
- Results
2. mT5
2.1. Model Architecture
mT5 is based on on the “T5.1.1” recipe, which improves upon T5 by using GeGLU nonlinearities, scaling both dmodel and dff instead of just dff in the larger models.
“span-corruption” objective pre-training is done, as the same in T5 on unlabeled data only with no Dropout.
- 5 model variants are trained.
2.2. Datasets
71 monthly web scrapes released so far by Common Crawl are used. This is dramatically more source data than was used for C4. (With filtering and removal.)
With such large dataset, there can be better coverage of tail langs.
- To have balancing the sample better, α is introduced to boost the tail:
α=0.3 is used in the final model, which gives a reasonable compromise between performance on high- and low-resource languages.
- Compared to T5, vocabulary size is increased to 250,000 wordpieces, using SentencePiece.
3. Results
The largest model mT5-XXL exceeds state-of-the-art on all classification and QA tasks and is near SOTA on NER (69.2 vs. 70.1).
Close the gap between T5 and mT5 when model gets larger.
Performance improves with increasing model capacity.
- The importance of in-language training data (whether gold In-Language Multitask or synthetic Translate-Train) decreases with model scale, as seen by Zero-Shot closing the quality gap.
- There are also other newer T5/mT5 variants developed, like nmT5, Switch Transformer, ByT5.