Brief Review — PPBERT: A Robustly Optimized BERT Pre-training Approach with Post-training

PPBERT: Pretraining+Post-Training+Fine-Tuning for BERT

Sik-Ho Tsang
3 min readJan 14, 2023

A Robustly Optimized BERT Pre-training Approach with Post-training,
PPBERT
, by Dongbei University of Finance and Economics, University of Southern California, Union Mobile Financial Technology, IBM Research,
2021 CCL, Over 50 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, LM, BERT

  • Compared with original BERT architecture that is based on the standard two-stage paradigm, PPBERT does not fine-tune pre-trained model directly, but rather post-train it on the domain or task related dataset first
  • This helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also from the training dataset reduce bias.

Outline

  1. PPBERT
  2. Results

1. PPBERT

An illustration of the architecture for or PPBERT, which is a ‘pre-training’-‘post-training’-then-‘fine-tuning’ three-stage BERT.

1.1. Pretraining (1st Stage)

  • The pre-training processing follows that of the BERT model.

1.2. Proposed Post-Training (2nd Stage)

  • PPBERT does not fine-tune pre-trained model, but rather first post-train the model on the task or domain related training dataset directly.
  • A second training stage is added, that is ‘post-training’ stage, on an intermediate task before target-task fine-tuning.
  • During post-training, each task is allocated K training iterations.
  • (Please feel free to read the paper directly for more details.)

1.3. Fine-Tuning (3rd Stage)

  • A supervised dataset from specific task is used to further fine-tune.

2. Results

2.1. GLUE

The overall performance of PPBERT and the comparison against BERT models on GLUE benchmark

PPBERTBASE achieves an average score of 81.53, and outperforms standard BERTBASE on all of the 8 tasks.

PPBERTLARGE outperform BERTLARGE on all of the 8 tasks and achieves an average score of 85.03.

  • Similar results are observed in the dev set column, achieving an average score of 87.02 on the dev set, a 2.97 improvement over BERTLARGE.
  • PPBERTLARGE matched or even outperformed human level.

2.2. SuperGLUE

Results on SuperGLUE benchmark.

PPBERT outperforms BERT on 8 tasks significantly.

  • There is a huge gap between human performance (89.79) and the performance of PPBERT (74.55).

2.3. SQuAD

Comparison with state-of-the-art results on the Dev set of SQuAD.
  • ALBERT is also post-trained as PPALBERT. Also, it is further post-train ed with one additional QA dataset (SearchQA), becoming PPALBERTLARGE-QA.

Compared with BERT baseline, adding post-training stage improves the EM by 1.1 points (84.1 > 85.2). and F1 1.2 points (90.9 > 92.1).

Similarly, PPALBERTLARGE also outperforms ALBERTLARGE baseline, by 0.3 EM and 0.2 F1.

  • Especially, PPALBERTLARGE-QA using further post-training relatively improves 0.1 EM and 0.1 F1 over PPALBERTLARGE, respectively.

2.3. 6 QA and NLI Tasks

Performance on six QA and NLI tasks.

PPALBERTLARGE outperforms ALBERTLARGE baseline, by 0.3 EM and 0.2 F1. Especially, PPALBERTLARGE-QA using further post-training relatively improves 0.1 EM and 0.1 F1 over PPALBERTLARGE, respectively.

  • Similar results are observed on SQuAD v2.0 development set.

Reference

[2021 CCL] [PPBERT]
A Robustly Optimized BERT Pre-training Approach with Post-training,
PPBERT

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

19912020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT] [UniLMv2] 2021 [PPBERT]

My Other Previous Paper Readings

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet