Brief Review — ERNIE: Enhanced Language Representation with Informative Entities
ERNIE 1.0, Using Information Entity to Improve BERT
ERNIE: Enhanced Language Representation with Informative Entities,
ERNIE, by Tsinghua University, and Huawei Noah’s Ark Lab,
2019 ACL, Over 1000 Citations (Sik-Ho Tsang @ Medium)Language Model
1991 … 2022 [GPT-NeoX-20B] [GPT-3.5, InstructGPT] [GLM] [MT-NLG 530B] [Chinchilla] [PaLM] [AlexaTM] [BLOOM] [AlexaTM 20B] [OPT] [Switch Transformers] [LaMDA] [LoRA] 2023 [GPT-4]
==== My Other Paper Readings Are Also Over Here ====
- Informative entities in knowledge graphs (KGs) can enhance language representation with external knowledge.
- In this paper, both large-scale textual corpora and KGs are utilized to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously.
Outline
- ERNIE
- Results
1. ERNIE
1.1. Overall Architecture
- T-Encoder() is a multi-layer bidirectional Transformer encoder, which is identical to its implementation in BERT:
- After computing {w1, …, wn}, ERNIE adopts a knowledgeable encoder K-Encoder to inject the knowledge information into language representation:
This K-Encoder fuses heterogeneous information and computing final output embeddings.
1.2. Knowledgeble Encoder
- K-Encoder consists of stacked aggregators.
- In the i-th aggregator, two multi-head self-attentions (MH-ATTs) respectively are used. One for input token embeddings. One for entity embedding:
- For a token wj and its aligned entity ek=f(wj), the information fusion process is as follows:
- where σ is GELU.
- For the tokens without corresponding entities, the information fusion layer computes the output embeddings without integration as follows
- For simplicity, the i-th aggregator operation is denoted as follows:
1.3. Pre-training for Injecting Knowledge
- A new pre-training task for ERNIE is proposed, namely denoising entity auto-encoder (dEA), which randomly masks some token-entity alignments and then requires the system to predict all corresponding entities based on aligned tokens:
- In 5% of the time, for a given token-entity alignment, the entity is replaced with another random entity.
- In 15% of the time, token-entity alignments are masked.
- Similar to BERT, ERNIE also adopts the masked language model (MLM) and the next sentence prediction (NSP).
1.4. Fine-tuning for Specific Tasks
- Depending on the task, the input token sequence is modified by adding two mark tokens [ENT] to highlight entity mentions. Or, different tokens [HD] and [TL] for head entities and tail entities are used respectively.
The modified input sequence with the mention mark token [ENT] can guide ERNIE to combine both context information and entity mention information attentively.
1.5. Pretraining Datasets
- For the large cost of training ERNIE from scratch, ERNIE adopts the parameters of BERT released by Google.
- English Wikipedia is used as the pre-training corpus and text is aligned to Wikidata, which has which contains 5,040,986 entities and 24,267,796 fact triples.
- The total amount of parameters of BERTBASE is about 110M and ERNIE has 114M, which means the knowledgeable module of ERNIE is small.
2. Results
2.1. Entity Typing
- Given an entity mention and its context, entity typing requires systems to label the entity mention with its respective semantic types.
On FIGER, compared with BERT, ERNIE significantly improves the strict accuracy, indicating the external knowledge regularizes ERNIE to avoid fitting the noisy labels and accordingly benefits entity typing.
On Open Entity, Compared to BERT, ERNIE improves the precision by 2% and the recall by 2%, which means the informative entities help ERNIE predict the labels more precisely.
2.2. Relation Classification
- Relation classification aims to determine the correct relation between two entities in a given sentence, which is an important knowledge-driven NLP task.
On FewRel, ERNIE achieves an absolute F1 increase of 3.4% over BERT, which means fusing external knowledge is very effective.
On TACRED, ERNIE achieves the best recall and F1 scores, and increases the F1 of BERT by nearly 2.0%, which proves the effectiveness of the knowledgeable module for relation classification.
2.3. GLUE
- w/o entities and w/o dEA refer to finetuning ERNIE without entity sequence input and the pre-training task dEA respectively.
Entities and dEA are important to contribute the gains.