Brief Review — GLM-130B: An Open Bilingual Pre-trained Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloomBergGPT]
==== My Other Paper Readings Are Also Over Here ====
- GLM is used as Backbone, which is a transformer-based language model that leverages autoregressive blank infilling as its training objective.
- In brief, for a text sequence, there is random spanned token masked, also random single tokens masked. The model is asked to recover them autoregressively. The corresponding objective function is:
- (Please feel free to read GLM for more details.)
- Post-LN initialized with the newly-proposed DeepNorm, as in DeepNet, is used for training stability.
- Rotary Positional Encoding (RoPE) is used for positonal embedding.
- To improve FFNs in Transformer, GLU with the GeLU is used.
- The pre-training data includes 1.2T Pile (Gao et al., 2020) English corpus, 1.0T Chinese Wudao-Corpora (Yuan et al., 2021), and 250G Chinese corpora.
- Multi-Task Instruction Pre-Training (MIP, 5% tokens) is used.
- GLM-130B is trained on a cluster of 96 DGX-A100 GPU (840G) servers with a 60-day access. The data parallelism and tensor model parallelism are used.
- FP16 for forwards and backwards and FP32 for optimizer states and master weights, to reduce the GPU memory usage and improve training efficiency.
- There is spike in pretraining loss, which can cause training failure. The gradient shrink on the embedding layer can help overcome loss spikes:
- FasterTransformer is leveraged to implement GLM-130B in C++.
- INT4 quantization of model weights (i.e., mostly linear layers) are used while keeping the FP16 precision for activations.
- This can help saving half of the required GPU memory to 70GB, thus allowing GLM-130B inference on 4× RTX 3090 Ti (24G) or 8× RTX 2080 Ti (11G), with nearly no performance degradation.
As a bilingual LLM also in Chinese, it offers significantly better results than ERNIE TITAN 3.0 260B — the largest Chinese LLM — on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE ones (+12.75%).
Importantly, as summarized in Figure 1(b), GLM-130B as an open model is associated with significantly less bias and generation toxicity than its 100B-scale counterparts.
GLM-130B to exhibit performance that surpasses the level of GPT-3 on a wide range of benchmarks (in total 112 tasks) and also outperforms PaLM 540B in many cases, while outperformance over GPT-3 has not been observed in OPT-175B and BLOOM-176B.
- (Please feel free to read the paper directly for more details.)