Brief Review: nmT5 — Is parallel data still relevant for pre-training massively multilingual language models?

nmT5, Further Parallel Data Pretraining for mT5

Sik-Ho Tsang
3 min readOct 22, 2023

nmT5 — Is parallel data still relevant for pre-training massively multilingual language models?
nmT5, by Google Research
2021 ACL (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====

  • mT5 is a massively multilingual version of T5. This paper investigates the impact of incorporating parallel data into mT5 pre-training.
  • Multi-tasking language modeling with objectives such as machine translation during pretraining is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks.


  1. nmT5
  2. Results

1. nmT5

1.1. Learning Objectives

Example source and targets for different text-to-text style pre-training objectives incorporating parallel data.
  • mT5-Large is used as the starting point.
  • TLM — A text-to-text version of translation language modeling, proposed by XLM.
  • NMT — Standard machine translation.
  • Denoised-NMT Similar to NMT, but additionally mask spans in the source sentence. The model must now learn to implicitly perform language modeling of the source language while translating into the target language.
  • Denoised-NMT+LM — Similar to Denoised-NMT, but instead of implicit language modeling, the model must explicitly predict the source text in addition to the translation.

1.2. Pretraining Datasets

  • For pre-training, the monolingual data is from mC4 (Xue et al., 2020) and parallel data is from OPUS-100 (Zhang et al., 2020).
  • OPUS-100 is a dataset of 55M translations covering 100 languages.
  • Pre-training is done with a batch size of 1M tokens and fine-tuning with 131,072 tokens.

1.3. Downstream Datasets

Downstream Datasets
  • The evaluation for TyDi QA, MTOP and NER is done in the zero-shot setting, where the model is trained on the English data and evaluated on all languages.
  • For WikiLingua, the model is trained in a multilingual setting, using available training data for all languages.

2. Results

  • Starting from publicly available mT5-Large checkpoints, the model is further pre-trained for 100K steps with a mix of monolingual and parallel objectives.
  • The parallel data is mixed into monolingual data at a 10% ratio, which amounts to roughly 4 passes over the OPUS-100 corpus.

Overall, adding parallel data through neural machine translation objectives improves scores for all 4 tasks, with the NMT objective performing the best.

Larger Model Size — XL
  • Even at the XL size (3.7B params, 3 larger than mT5-Large), it is observed there are gains for all tasks with nmT5 (Table 3).

However, the magnitude of the gains is largely diminished, hinting that the need for parallel data reduces as model capacity increases.

Unseen Languages

nmT5 outperforms mT5 on this subset of languages as well, indicating that the representations of the nmT5 model are better suited for cross-lingual transfer.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.