Brief Review — ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

ERNIE 3.0: Scaling Up Model to 10B & Dataset to 4TB

Sik-Ho Tsang
4 min readOct 17, 2023
ERNIE 3.0 Framework

ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
, by Baidu Inc.
2021 arXiv v1, Over 200 Citations (Sik-Ho Tsang @ Medium)

Large Langauge Model (LLM)
2020 … 2023 [GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • After publishing ERNIE 1.0 and ERNIE 2.0, authors proposed a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models, which fuses auto-regressive network and auto-encoding network.
  • The model is trained with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
  • Later, 260B-Large ERNIE 3.0 Titan is proposed.


  1. ERNIE 3.0
  2. Results

1. ERNIE 3.0

1.1. Universal Representation Module

ERNIE 3.0 uses a multi-layer Transformer-XL as the backbone where Transformer-XL can support longer sequence than standard Transformer.

1.2. Task-Specific Representation Module

The task-specific representation module is also a multi-layer Transformer-XL instead of simple head.

1.3. Pre-training Tasks

  • [Knowledge Masked Language Modeling] is introduced in ERNIE 1.0 where it introduced phrase masking and named entity masking that predict the whole masked phrases and named entities.
  • [Document Language Modeling] is used so that ERNIE 3.0 can model a larger effective context length.
  • [Sentence Reordering] is introduced in ERNIE 2.0. Segments are permuted. The pre-trained model is asked to reorganize these permuted segments, modeled as a k-class classification problem.
  • [Sentence Distance] A 3-class classification problem in which the three categories represent that the two sentences are adjacent, nonadjacent but in the same document and from two different documents respectively.
Universal Knowledge-Text Prediction (UKTP)
  • [Universal Knowledge-Text Prediction (UKTP)]: Given a pair of triple from knowledge graph and the corresponding sentence from encyclopedia, relation in triple or words in a sentence are randomly masked. To predict the relation in the triple, the model needs to detect mentions of head entity and tail entity and determine semantic relationship that holds between them in the corresponding sentence.

1.4. Pre-training Algorithm

  • The model is trained progressively and simultaneously with increasing the training factors including the input sequence length, the batch size, the learning rate and the Dropout rate.
  • For the universal representation module, a structure with 48 layers, 4096 hidden units and 64 heads, is used.
  • For the task-specific representation modules, a structure with 12 layers, 768 hidden units and 12 heads, is used.
  • The total parameter of universal representation module and task-specific representation modules is 10 billion.
  • GELU is used.
  • The maximum sequence length of context and the memory length of language generation is set to 512 and 128, respectively.
  • The total batch size of all pre-training tasks is set to 6144.
  • The model is trained for a total of 375 billion tokens with 384 NVDIA v100 GPU cards and is implemented on PaddlePaddle framework.

1.5. Pre-training Data

Pre-training Data

4TB storage-size dataset in 11 different categories is used. To authors’ best knowledge, this is currently the largest Chinese pre-training corpora at that moment.

2. Results

2.1. Fine-Tuning

Results on Natural Language Understanding Tasks.

ERNIE 3.0 obtains the best results nearly for all datasets.

Results on Natural Language Generation Tasks.
LUGE Benchmark.

ERNIE 3.0 obtains the best results.

2.2. Zero-Shot

Zero-Shot Learning Tasks

ERNIE 3.0 obtains the best results.

2.3. Samples

Samples of Zero-Shot Generations
Samples of Zero-Shot Generations
  • Some examples are shown above.

2.4. English

  • ERNIE 3.0 surpasses T5 and DeBERTa and obtains a score of 90.6, taking the first place in SuperGLUE Benchmark.
  • (Please read the paper for more results.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.