Brief Review — ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
ERNIE 3.0: Scaling Up Model to 10B & Dataset to 4TB
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
ERNIE 3.0, by Baidu Inc.
2021 arXiv v1, Over 200 Citations (Sik-Ho Tsang @ Medium)Large Langauge Model (LLM)
2020 … 2023 [GPT-4] [LLaMA] [Koala] [BloombergGPT] [GLM-130B] [UL2] [PaLM 2] [Llama 2] [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====
- After publishing ERNIE 1.0 and ERNIE 2.0, authors proposed a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models, which fuses auto-regressive network and auto-encoding network.
- The model is trained with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
- Later, 260B-Large ERNIE 3.0 Titan is proposed.
Outline
- ERNIE 3.0
- Results
1. ERNIE 3.0
1.1. Universal Representation Module
ERNIE 3.0 uses a multi-layer Transformer-XL as the backbone where Transformer-XL can support longer sequence than standard Transformer.
1.2. Task-Specific Representation Module
The task-specific representation module is also a multi-layer Transformer-XL instead of simple head.
1.3. Pre-training Tasks
- [Knowledge Masked Language Modeling] is introduced in ERNIE 1.0 where it introduced phrase masking and named entity masking that predict the whole masked phrases and named entities.
- [Document Language Modeling] is used so that ERNIE 3.0 can model a larger effective context length.
- [Sentence Reordering] is introduced in ERNIE 2.0. Segments are permuted. The pre-trained model is asked to reorganize these permuted segments, modeled as a k-class classification problem.
- [Sentence Distance] A 3-class classification problem in which the three categories represent that the two sentences are adjacent, nonadjacent but in the same document and from two different documents respectively.
- [Universal Knowledge-Text Prediction (UKTP)]: Given a pair of triple from knowledge graph and the corresponding sentence from encyclopedia, relation in triple or words in a sentence are randomly masked. To predict the relation in the triple, the model needs to detect mentions of head entity and tail entity and determine semantic relationship that holds between them in the corresponding sentence.
1.4. Pre-training Algorithm
- The model is trained progressively and simultaneously with increasing the training factors including the input sequence length, the batch size, the learning rate and the Dropout rate.
- For the universal representation module, a structure with 48 layers, 4096 hidden units and 64 heads, is used.
- For the task-specific representation modules, a structure with 12 layers, 768 hidden units and 12 heads, is used.
- The total parameter of universal representation module and task-specific representation modules is 10 billion.
- GELU is used.
- The maximum sequence length of context and the memory length of language generation is set to 512 and 128, respectively.
- The total batch size of all pre-training tasks is set to 6144.
- The model is trained for a total of 375 billion tokens with 384 NVDIA v100 GPU cards and is implemented on PaddlePaddle framework.
1.5. Pre-training Data
4TB storage-size dataset in 11 different categories is used. To authors’ best knowledge, this is currently the largest Chinese pre-training corpora at that moment.
2. Results
2.1. Fine-Tuning
ERNIE 3.0 obtains the best results nearly for all datasets.
ERNIE 3.0 obtains the best results.
2.2. Zero-Shot
ERNIE 3.0 obtains the best results.
2.3. Samples
- Some examples are shown above.