Brief Review — GLM-130B: An Open Bilingual Pre-trained Model

GLM-130B, Supports Both Chinese & English, Outperforms GPT-3, OPT, BLOOM, PaLM

Sik-Ho Tsang
4 min readJun 17, 2023
Based on GLM-130B, ChatGLM is Developed Which Supports Both Chinese and English (From https://www.zhipuai.cn/)

GLM-130B: An Open Bilingual Pre-trained Model,
GLM-130B, by Tsinghua University, and Zhipu.AI,
2023 ICLR (Sik-Ho Tsang @ Medium)

Language Model
1991 … 2022
[GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] [Galactica] 2023 [GPT-4] [LLaMA] [LIMA] [Koala] [BloomBergGPT]
==== My Other Paper Readings Are Also Over Here ====

  • GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters is proposed, with design choices, training strategies used for both efficiency and stability.
  • ChatGLM is developed based on GLM-130B: https://chatglm.cn/

Outline

  1. GLM-130
  2. Results

1. GLM-130

1.1. Model

  • GLM is used as Backbone, which is a transformer-based language model that leverages autoregressive blank infilling as its training objective.
  • In brief, for a text sequence, there is random spanned token masked, also random single tokens masked. The model is asked to recover them autoregressively. The corresponding objective function is:
  • (Please feel free to read GLM for more details.)
DeepNorm, originated in DeepNet, Used for Training Stability

1.2. Corpus

  • The pre-training data includes 1.2T Pile (Gao et al., 2020) English corpus, 1.0T Chinese Wudao-Corpora (Yuan et al., 2021), and 250G Chinese corpora.
  • Multi-Task Instruction Pre-Training (MIP, 5% tokens) is used.

1.3. Platform

  • GLM-130B is trained on a cluster of 96 DGX-A100 GPU (840G) servers with a 60-day access. The data parallelism and tensor model parallelism are used.
  • FP16 for forwards and backwards and FP32 for optimizer states and master weights, to reduce the GPU memory usage and improve training efficiency.
Spike and Training Failure Without Shrinking
  • There is spike in pretraining loss, which can cause training failure. The gradient shrink on the embedding layer can help overcome loss spikes:

1.4. Inference

Quantization
  • FasterTransformer is leveraged to implement GLM-130B in C++.
  • INT4 quantization of model weights (i.e., mostly linear layers) are used while keeping the FP16 precision for activations.
  • This can help saving half of the required GPU memory to 70GB, thus allowing GLM-130B inference on 4× RTX 3090 Ti (24G) or 8× RTX 2080 Ti (11G), with nearly no performance degradation.

2. Results

Performance Summary

For zero-shot performance, GLM-130B is better than GPT-3 175B (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA, and achieves 3× better performance than GPT-3 on Big-bench-lite.

For the 5-shot MMLU (Hendrycks et al., 2021) tasks, it is better than GPT-3 175B (+0.9%) and BLOOM-176B (+12.7%).

As a bilingual LLM also in Chinese, it offers significantly better results than ERNIE TITAN 3.0 260B — the largest Chinese LLM — on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE ones (+12.75%).

Importantly, as summarized in Figure 1(b), GLM-130B as an open model is associated with significantly less bias and generation toxicity than its 100B-scale counterparts.

SOTA Comparisons

GLM-130B to exhibit performance that surpasses the level of GPT-3 on a wide range of benchmarks (in total 112 tasks) and also outperforms PaLM 540B in many cases, while outperformance over GPT-3 has not been observed in OPT-175B and BLOOM-176B.

  • (Please feel free to read the paper directly for more details.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.