Brief Review — ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding

Continual Multi-Task Learning Without Forgetting

Sik-Ho Tsang
4 min readOct 15, 2023
The framework of ERNIE 2.0, where the pre-training tasks can be incrementally constructed, the models are pre-trained through continual multi-task learning

ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding
, by Baidu Inc.
2020 AAAI, Over 670 Citations (Sik-Ho Tsang @ Medium)

Language Model (LM)
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE]
==== My Other Paper Readings Are Also Over Here ====

  • A continual pre-training framework named ERNIE 2.0 is proposed, which incrementally builds pre-training tasks and then learn pre-trained models on these constructed tasks via continual multi-task learning.
  • Later, ERNIE 3.0 is also proposed.


  1. ERNIE 2.0
  2. Results

1. ERNIE 2.0

1.1. Model

  • The model is a standard Transformer encoder.
  • Task embedding is introduced to represent the characteristic of different tasks. Each task has an unique id ranging from 0 to N.

1.2. Task Types

There are 3 kinds of pretraining tasks, as shown above: The word-aware tasks enable the model to capture the lexical information, the structure-aware tasks enable the model capture the syntactic information of the corpus and the semantic-aware tasks aims to learn semantic information.

1.3. Word-aware Task

  • [Knowledge Masking Task] The pretraining task in ERNIE 1.0 is used. It introduced phrase masking and named entity masking and predicts the whole masked phrases and named entities.
  • [Capitalization Prediction Task] Capitalized words usually have certain specific semantic information compared to other words in sentences. A task is added to predict whether the word is capitalized or not.
  • [Token-Document Relation Prediction Task] This task predicts whether the token in a segment appears in other segments of the original document.

1.4. Structure-aware Task

  • [Sentence Reordering Task] A given paragraph is randomly split into 1 to m segments and then all of the combinations are shuffled. A k-class classification problem is modeled by reorganize these permuted segments.
  • [Sentence Distance Task] This task is modeled as a 3-class classification problem. “0” represents that the two sentences are adjacent in the same document, “1” represent that the two sentences are in the same document, but not adjacent, and “2” represents that the two sentences are from two different documents.

1.5. Semantic-aware Task

  • [Discourse Relation Task] A task is introduced to predict the semantic or rhetorical relation between two sentences. The data by Sileo (Sileo et al. 2019) is used for this task.
  • [IR Relevance Task] It is a 3-class classification task which predicts the relationship between a query and a title. The search log data from a commercial search engine is used. “0” stands for strong relevance, which means that the title is clicked by the users after they input the query. Those labelled as “1” represent weak relevance, which implies that when the query is input by the users, these titles appear in the search results but failed to be clicked by users. The label “2” means that the query and title are completely irrelevant and random.

1.6. Continue Multi-Task Learning

Multi-Task Learning
  • Traditional continual learning method trains the model with only one task at each stage with the demerit that it may forget the previously learned knowledge.

In ERNIE 2.0, during pre-training, one sentence-level loss function can be combined with multiple token-level loss functions to continually update the model.

2. Results

2.1. Datasets

Pretraining Datasets
  • For English and Chinese, they use different datasets.

2.2. English


ERNIE 2.0BASE outperforms BERTBASE on all of the 10 tasks and obtains a score of 80.6.

ERNIE 2.0LARGE outperforms BERTLARGE on all of the 10 tasks, which gets a score of 83.6 on the GLUE test set and achieves a 3.1% improvement over the previous SOTA pre-training model BERTLARGE.

2.3. Chinese

9 Chinese NLP Tasks
  • ERNIE 1.0BASE outperforms BERTBASE on XNLI, MSRA-NER, ChnSentiCorp, LCQMC and NLPCC-DBQA tasks.

The proposed ERNIE 2.0 makes further progress, which significantly outperforms BERTBASE on all of the nine tasks.

ERNIE 2.0LARGE achieves the best performance.

2.4. Different Pretraining Methods

Different Pretraining Methods

Continual multi-task learning obtains the better performance on downstream tasks compared with the other two methods, without sacrificing any efficiency.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.