Brief Review — Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Scaling Up and by Model Parallelism

Sik-Ho Tsang
4 min readNov 12, 2022

--

8.3 billion-parameter -like Language Model by Megatron-LM

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,
Megatron-LM
, by NVIDIA,
2020 arXiv v4, Over 500 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, ,

  • Very large models can be quite difficult to train due to memory constraints.
  • By the proposed model parallelism, an 8.3 billion-parameter and a 3.9 billion-parameter , are trained.

Outline

  1. Model Parallelism
  2. Results

1. Model Parallelism

Each layer has a self-attention layer and a MLP. Model parallel implementation is to be considered for the .

1.1. Model Parallelism in MLP

  • The first part of the block is a General Matrix-Matrix Multiplication (GEMM) followed by a nonlinearity:
  • One option to parallelize the GEMM is to split the weight matrix A along its rows and input X along its columns as:
  • This approach will require a synchronization point before the .
  • Another option is to split A along its columns A=[A1, A2]. This partitioning allows the nonlinearity to be independently applied to the output of each partitioned GEMM:
  • This is advantageous as it removes a synchronization point.
Model Parallelism in MLP

Hence, the first GEMM is partitioned in this column parallel fashion and the second GEMM is split along its rows so it takes the output of the layer directly without requiring any communication, as above.

1.1. Model Parallelism in Self-Attention Block

Model Parallelism in Self-Attention Block
  • GEMMs associated with key (K), query (Q), and value (V) in a column parallel fashion such that the matrix multiply corresponding to each attention head is done locally on one GPU.
  • The subsequent GEMM from the output linear layer (after self attention) is parallelized along its rows and takes the output of the parallel attention layer directly, without requiring communication between the GPUs.

1.3. Putting it All Together

Communication operations in a layer.
  • This approach for both the MLP and self attention layer fuses groups of two GEMMs, eliminates a synchronization point in between, and results in better scaling.
  • This enables us to perform all GEMMs in a simple layer using only two all-reduces in the forward path and two in the backward path.

2. Results

2.1. Language Modeling Results Using

Scaling Up Using Proposed Model Parallelism
  • Training data: 174 GB WebText/CC Stories/Wikipedia/RealNews.
  • 3 model sizes: 355 million, 2.5 billion, and 8.3 billion.
Validation set perplexity

Larger language models converge noticeably faster and converge to lower validation perplexities than their smaller counterparts.

Zero shot evaluation

Increasing model size also leads to lower perplexity on WikiText103 and higher cloze accuracy on LAMBADA.

2.2. Bi-directional Results Using

Training loss for model using (a) the original architecture and (b) the rearranged architecture
  • According to , the ’s is rearranged, within the skip connection, which can help for stable training.
Scaling Up Using Proposed Model Parallelism
  • 3 Model sizes are established: 334M, 1.3B, and 3.9B.
Development set results for MNLI, QQP, SQuAD 1.1 and SQuAD 2.0 and test set results for RACE

By scaling up, Megatron-3.9B obtains the best results.

  • NVIDIA presents Megatron-LM in their GTC 2020 (link provided below). But later, is much much larger, with model size of 175B.

References

[2020 arXiv v4] [Megatron-LM]

[GTC 2020] [Megatron-LM]

4.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

19912020 [] [] [] [] [] [] [] [] [] [Megatron-LM]

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

No responses yet

Write a response