Review — MT-NLG: Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

MT-NLG 530B, Microsoft & NVIDIA Cooperation, Much Larger & Better Than GPT-3

Sik-Ho Tsang
6 min readMar 25


Trend of sizes of state-of-the-art NLP models with time.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model,
MT-NLG, by Microsoft, and NVIDIA,
2022 arXiv v3, Over 170 Citations (Sik-Ho Tsang @ Medium)

2.1. Language Model
1991 … 2021 [Performer] [gMLP] [Roformer] [PPBERT] [DeBERTa] [DeLighT] [Transformer-LS] 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====

  • Training such large models is challenging for two reasons. First, it is no longer possible to fit the parameters of these models in the memory of even the largest GPU. Second, the large number of compute operations required can result in unrealistically long training times.
  • In this paper, a joint effort between Microsoft and NVIDIA is presented on the training of the largest (at the publishing date) monolithic Transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters.
  • 3D parallelism methodology used to train this model using Microsoft’s DeepSpeed and NVIDIA’s Megatron-LM.


  1. Memory Issues
  2. MT-NLG 530B
  3. Results

1. Memory Issues

  • Assuming training with the Adam optimizer, training consumess 20 bytes of memory per parameter:

Training a 530 billion parameter model thus requires over 10 terabytes of aggregate memory for the model weights, gradients, and optimizer states.

  • Activations can also consume significant memory and scale with the training batch size, sequence length, and model dimensions:

which is approximately 16.9 terabytes.

So, parallelism is proposed to train MT-NLG 530B.

2. MT-NLG 530B

The system software stack combines pipeline parallelism and data parallelism from DeepSpeed with tensor slicing from Megatron.

2.1. Parallelism Types

  • Data parallelism is a ubiquitous technique in deep learning in which each input batch of training data is divided among the data-parallel workers.
  • Tensor model parallelism (or, tensor parallelism) is a broad class of model parallelism techniques that partitions the individual layers of the model across workers.
  • Pipeline model parallelism (or, pipeline parallelism) divides the layers of the model into stages that can be processed in parallel.

2.2. 3D Parallelism with DeepSpeed and Megatron

Memory Efficiency: Transformer blocks are divided into pipeline stages, and the blocks of each stage are further divided via tensor parallelism.

  • This 2D combination simultaneously reduces the memory consumed by the weights, gradients, optimizer states, and activations.

Compute Efficiency: To further accelerate training, data parallelism is used to scale to arbitrarily large number of GPUs.

  • For example, each 530 billion parameter model replica spans 280 NVIDIA A100 GPUs, with 8-way tensor-slicing within a node and 35-way pipeline parallelism across nodes. data parallelism is then used to scale out further to thousands of GPUs.

2.3. Computer Architecture Details

  • Model training is done with mixed precision using 16-bit bfloat on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes.
  • Each cluster node has 8 NVIDIA 80-GB A100 GPUs, connected to each other by NVLink and NVSwitch.
  • Each node has eight NVIDIA Mellanox 200Gbps HDR Infiniband HCAs for application communication, with an additional two HCAs per node for dedicated storage.
  • For 530 billion parameter model with batch size 1920 on 280, 350, and 420 DGX A100 servers on Selene, the iteration times of 60.1, 50.2, and 44.4 seconds, are observed respectively. These correspond to 126, 121, and 113 teraFLOP/s per GPU, respectively.

2.4. Transformer Model Architecture

  • The architecture of the Transformer decoder, which is a left-to-right, autoregressive, generative Transformer-based language model, and scaled it up to 530 billion parameters.
  • The number of layers, hidden dimensions, and attention heads are 105, 20480, and 128, respectively.
  • The sequence length is 2048 and the global batch size is 1920. 8-way tensor and 35-way pipeline parallelism are used.

2.5. Training Datasets

Datasets used to train the MT-NLG model.
  • The top 11 rows are from the Pile dataset, followed by two Common Crawl snapshots, RealNews, and CC-Stories datasets.
  • The training dataset consists of 339 billion tokens and MT-NLG is trained on 270 billions tokens by blending the 15 training datasets.

3. Results

  • (There are so many tasks tested, please read the paper directly for details.)

3.1. Validation Loss

Validation loss of MT-NLG.

The validation cross-entropy loss is 3.15 after the model is trained on the first 1 billion tokens.

3.2. Completion Prediction

LAMBADA zero-shot, one-shot and few-shot accuracy.
  • When evaluating this task zero-shot, each passage is fed to the model as input and check if the model can produce the correct last word via greedy generation (picking tokens with maximum probability).
  • However, for one-/few-shot evaluations, the problem is switched over to a cloze-style prompt format to better suggest to the model that the task is about predicting the last word of a sentence as opposed to arbitrary plausible continuation.
  • In such a case, “_. →” is inserted before the last word, e.g. “… Paul and Debbie looked at each other, then at _. →Bob” and examine if the model would predict the correct word after the “→”.

MT-NLG outperforms previous models, such as GPT-3, across different settings and establishes new SOTA for all 3 settings.

3.3. Reading Comprehension

Reading comprehension results on RACE-h and BoolQ.

BoolQ scores significantly improve from zero-shot to few-shot, while RACE-h does not benefit from having many examples.

3.4. Commonsense Reasoning

Commonsense reasoning results on Winogrande, HellaSWAG and PiQA.

It is generally observed that minor gain or even performance dips when moving from zero-shot to one-shot, but would observe significant gains when moving from zero-shot to few-shot settings.

3.5. Natural Language Inference

Natural Language Inference accuracy on the HANS dataset, as a function of the number of shots and the amount of training
Natural language inference results on ANLI (R2) and HANS datasets.

At zero-shot, models are struggling at chance level for HANS, yet MT-NLG is very effective in leveraging in-context examples as the number of shots increases.

3.6. Word Sense Disambiguation

Word-in-Context dataset results.

There are significant improvements moving from zero-shot to few-shot, surpassing chance level performance.

  • (There are other issues mentioned in the paper as well, such as Social Bias, Gender and Occupation Analysis, Adjective Co-Occurrence Analysis (Gender, Ethnicity and Religion), Sentiment Analysis, Discussions, Limitations, and so on.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.