Brief Review — mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,
mT5, by Google Research
2021 NAACL, Over 800 Citations (Sik-Ho Tsang @ Medium)
- mT5, a multilingual variant of T5, is proposed that was pre-trained on a new Common Crawl-based dataset covering 101 languages.
2.1. Model Architecture
- 5 model variants are trained.
71 monthly web scrapes released so far by Common Crawl are used. This is dramatically more source data than was used for C4. (With filtering and removal.)
With such large dataset, there can be better coverage of tail langs.
- To have balancing the sample better, α is introduced to boost the tail:
α=0.3 is used in the final model, which gives a reasonable compromise between performance on high- and low-resource languages.
The largest model mT5-XXL exceeds state-of-the-art on all classification and QA tasks and is near SOTA on NER (69.2 vs. 70.1).
Close the gap between T5 and mT5 when model gets larger.
Performance improves with increasing model capacity.
- The importance of in-language training data (whether gold In-Language Multitask or synthetic Translate-Train) decreases with model scale, as seen by Zero-Shot closing the quality gap.