Brief Review — GLM-130B: An Open Bilingual Pre-trained Model

GLM-130B, Supports Both Chinese & English, Outperforms GPT-3, OPT, BLOOM, PaLM

4 min readJun 17, 2023

**Based on GLM-130B, ChatGLM is Developed Which Supports Both Chinese and English** (From https://www.zhipuai.cn/)

GLM-130B: An Open Bilingual Pre-trained Model,
GLM-130B, by Tsinghua University, and Zhipu.AI,
2023 ICLR (Sik-Ho Tsang @ Medium)
Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloomBergGPT]
==== My Other Paper Readings Are Also Over Here ====

GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters is proposed, with design choices, training strategies used for both efficiency and stability.
ChatGLM is developed based on GLM-130B: https://chatglm.cn/

Outline

GLM-130
Results

1. GLM-130

1.1. Model

GLM is used as Backbone, which is a transformer-based language model that leverages autoregressive blank infilling as its training objective.
In brief, for a text sequence, there is random spanned token masked, also random single tokens masked. The model is asked to recover them autoregressively. The corresponding objective function is:

(Please feel free to read GLM for more details.)

**DeepNorm, originated in** **DeepNet, Used for Training Stability**

Post-LN initialized with the newly-proposed DeepNorm, as in DeepNet, is used for training stability.
Rotary Positional Encoding (RoPE) is used for positonal embedding.
To improve FFNs in Transformer, GLU with the GeLU is used.

1.2. Corpus

The pre-training data includes 1.2T Pile (Gao et al., 2020) English corpus, 1.0T Chinese Wudao-Corpora (Yuan et al., 2021), and 250G Chinese corpora.
Multi-Task Instruction Pre-Training (MIP, 5% tokens) is used.

1.3. Platform

GLM-130B is trained on a cluster of 96 DGX-A100 GPU (840G) servers with a 60-day access. The data parallelism and tensor model parallelism are used.
FP16 for forwards and backwards and FP32 for optimizer states and master weights, to reduce the GPU memory usage and improve training efficiency.

**Spike and Training Failure Without Shrinking**

There is spike in pretraining loss, which can cause training failure. The gradient shrink on the embedding layer can help overcome loss spikes:

1.4. Inference

FasterTransformer is leveraged to implement GLM-130B in C++.
INT4 quantization of model weights (i.e., mostly linear layers) are used while keeping the FP16 precision for activations.
This can help saving half of the required GPU memory to 70GB, thus allowing GLM-130B inference on 4× RTX 3090 Ti (24G) or 8× RTX 2080 Ti (11G), with nearly no performance degradation.

2. Results

For zero-shot performance, GLM-130B is better than GPT-3 175B (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA, and achieves 3× better performance than GPT-3 on Big-bench-lite.
For the 5-shot MMLU (Hendrycks et al., 2021) tasks, it is better than GPT-3 175B (+0.9%) and BLOOM-176B (+12.7%).
As a bilingual LLM also in Chinese, it offers significantly better results than ERNIE TITAN 3.0 260B — the largest Chinese LLM — on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE ones (+12.75%).
Importantly, as summarized in Figure 1(b), GLM-130B as an open model is associated with significantly less bias and generation toxicity than its 100B-scale counterparts.

GLM-130B to exhibit performance that surpasses the level of GPT-3 on a wide range of benchmarks (in total 112 tasks) and also outperforms PaLM 540B in many cases, while outperformance over GPT-3 has not been observed in OPT-175B and BLOOM-176B.

(Please feel free to read the paper directly for more details.)

Brief Review — GLM-130B: An Open Bilingual Pre-trained Model

GLM-130B, Supports Both Chinese & English, Outperforms GPT-3, OPT, BLOOM, PaLM

Outline

1. GLM-130

1.1. Model

1.2. Corpus

1.3. Platform

1.4. Inference

2. Results

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sik-Ho Tsang

No responses yet