Brief Review — Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Scaling Up GPT-2 and BERT by Model Parallelism

Sik-Ho Tsang
4 min readNov 12, 2022
8.3 billion-parameter GPT-2-like Language Model by Megatron-LM

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,
, by NVIDIA,
2020 arXiv v4, Over 500 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, BERT, GPT-2

  • Very large models can be quite difficult to train due to memory constraints.
  • By the proposed model parallelism, an 8.3 billion-parameter GPT-2 and a 3.9 billion-parameter BERT, are trained.


  1. Model Parallelism Transformer
  2. Results

1. Model Parallelism Transformer

Each Transformer layer has a self-attention layer and a MLP. Model parallel implementation is to be considered for the Transformer.

1.1. Model Parallelism in Transformer MLP

  • The first part of the block is a General Matrix-Matrix Multiplication (GEMM) followed by a GELU nonlinearity:
  • One option to parallelize the GEMM is to split the weight matrix A along its rows and input X along its columns as:
  • This approach will require a synchronization point before the GELU.
  • Another option is to split A along its columns A=[A1, A2]. This partitioning allows the GELU nonlinearity to be independently applied to the output of each partitioned GEMM:
  • This is advantageous as it removes a synchronization point.
Model Parallelism in Transformer MLP

Hence, the first GEMM is partitioned in this column parallel fashion and the second GEMM is split along its rows so it takes the output of the GELU layer directly without requiring any communication, as above.

1.1. Model Parallelism in Transformer Self-Attention Block

Model Parallelism in Transformer Self-Attention Block
  • GEMMs associated with key (K), query (Q), and value (V) in a column parallel fashion such that the matrix multiply corresponding to each attention head is done locally on one GPU.
  • The subsequent GEMM from the output linear layer (after self attention) is parallelized along its rows and takes the output of the parallel attention layer directly, without requiring communication between the GPUs.

1.3. Putting it All Together

Communication operations in a Transformer layer.
  • This approach for both the MLP and self attention layer fuses groups of two GEMMs, eliminates a synchronization point in between, and results in better scaling.
  • This enables us to perform all GEMMs in a simple Transformer layer using only two all-reduces in the forward path and two in the backward path.

2. Results

2.1. Language Modeling Results Using GPT-2

Scaling Up GPT-2 Using Proposed Model Parallelism
  • Training data: 174 GB WebText/CC Stories/Wikipedia/RealNews.
  • 3 model sizes: 355 million, 2.5 billion, and 8.3 billion.
Validation set perplexity

Larger language models converge noticeably faster and converge to lower validation perplexities than their smaller counterparts.

Zero shot evaluation

Increasing model size also leads to lower perplexity on WikiText103 and higher cloze accuracy on LAMBADA.

2.2. Bi-directional Transformer Results Using BERT

Training loss for BERT model using (a) the original architecture and (b) the rearranged architecture
  • According to ALBERT, the BERT’s Layer Norm is rearranged, within the skip connection, which can help for stable training.
Scaling Up BERT Using Proposed Model Parallelism
  • 3 Model sizes are established: 334M, 1.3B, and 3.9B.
Development set results for MNLI, QQP, SQuAD 1.1 and SQuAD 2.0 and test set results for RACE

By scaling up, Megatron-3.9B obtains the best results.

  • NVIDIA presents Megatron-LM in their GTC 2020 (link provided below). But later, GPT-3 is much much larger, with model size of 175B.


[2020 arXiv v4] [Megatron-LM]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

[GTC 2020] [Megatron-LM]

4.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

19912020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.