Review — GPT: Improving Language Understanding by Generative Pre-Training

Pretraining Using T-DMCA, a Kind of Transformer, for Other Downstream Tasks

6 min readDec 23, 2021

**OpenAI GPT** (Trm: Transformer, Figure from BERT)

Improving Language Understanding by Generative Pre-Training,
GPT, by OpenAI
2018 OpenAI Tech Report, Over 2700 citations (Sik-Ho Tsang @ Medium)
Language Model

Large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
Task-aware input transformations are used during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture.

It is a kind of self-supervised training that the pretext task is the pre-training of a language model and downstream tasks are the tasks in GLUE benchmark.

Outline

GPT Framework
Task-Specific Input Transformations
Experimental Results

1. GPT Framework

The training procedure consists of two stages. The first stage is learning a high-capacity language model (Unsupervised Pre-Training) on a large corpus of text. This is followed by a (Supervised) fine-tuning stage, where we adapt the model to a discriminative task with labeled data.

1.1. Unsupervised Pre-Training

Given an unsupervised corpus of tokens U = {u1, …, un}, we use a standard language modeling objective to maximize the following likelihood:

T-DMCA is used which is a memory-reduced Transformer, with only the use of decoder.
This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:

where U=(u_-k, …, u_-1) is the context vector of tokens, n is the number of layers, We is the token embedding matrix, and Wp is the position embedding matrix.

1.2. Supervised Fine-Tuning

Assume there is a labeled dataset C, where each instance consists of a sequence of input tokens, x1, …, xm, along with a label y. The inputs are passed through the pre-trained model to obtain the final Transformer block’s activation hml, which is then fed into an added linear output layer with parameters Wy to predict y:

The following objective to is to be maximized:

It is found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence.
Overall, the only extra parameters we require during fine-tuning are Wy, and embeddings for delimiter tokens.

2. Task-Specific Input Transformations

**Left:** **Transformer** **(Particularly** **T-DMCA) architecture and training objectives used in this work. Right: Input transformations for fine-tuning on different tasks.**

For some tasks, like text classification, we can directly fine-tune the model.
Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers.
A traversal-style approach [52] is used, where the structured inputs are converted into an ordered sequence that the pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks.
For entailment tasks, the premise p and hypothesis h token sequences are concatenated, with a delimiter token ($) in between, as shown above.
For similarity tasks, the input sequence is modified to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations hml which are added element-wise before being fed into the linear output layer.
For question answering and commonsense reasoning, given a context document z, a question q, and a set of possible answers {ak}. The document context and question are concatenated with each possible answer. A delimiter token is added in between to get [z, q, $, ak].
Each of these sequences are processed independently with the model and then normalized via a softmax layer to produce an output distribution over possible answers.

3. Experimental Results

3.1. SOTA Comparison

BooksCorpus dataset is used for pre-training. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.
An alternative dataset, the 1B Word Benchmark. The proposed language model achieves a very low token level perplexity of 18.4 on this corpus.
The model is trained for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens.
GLUE benchmark is used for evaluation.
The finetunes are quick and 3 epochs of training was sufficient for most cases.

**Experimental results on natural language inference tasks, comparing our model with current state-of-the-art methods. 5**× **indicates an ensemble of 5 models. All datasets use accuracy as the evaluation metric.**

The proposed method significantly outperforms the baselines on four of the five datasets, achieving absolute improvements of up to 1.5% on MNLI, 5% on SciTail, 5.8% on QNLI and 0.6% on SNLI over the previous best results.

On RTE, one of the smaller datasets evaluated on (2490 examples), an accuracy of 56% is achieved, which is below the 61.7% reported by a multi-task biLSTM model.

**Results on question answering and commonsense reasoning, comparing our model with current state-of-the-art methods. 9× means an ensemble of 9 models.**

RACE dataset [30], consists of English passages with associated questions from middle and high school exams.
Story Cloze Test [40], involves selecting the correct ending to multi-sentence stories from two options.

The proposed model again outperforms the previous best results by significant margins — up to 8.9% on Story Cloze, and 5.7% overall on RACE. This demonstrates the ability of our model to handle long-range contexts effectively.

**Semantic similarity and classification results, comparing our model with current state-of-the-art methods. All task evaluations in this table were done using the** **GLUE** **benchmark.**

Semantic similarity (or paraphrase detection) tasks involve predicting whether two sentences are semantically equivalent or not.
The performance delta on QQP is significant, with a 4.2% absolute improvement over Single-task BiLSTM + ELMo + Attn.
For classification tasks, the proposed model also achieves 91.3% accuracy on SST-2, which is competitive with the state-of-the-art results.

The proposed model also achieve an overall score of 72.8 on the GLUE benchmark, which is significantly better than the previous best of 68.9.

3.2. Analysis

Left: Effect of transferring increasing number of layers from the pre-trained language model on RACE and MultiNLI. Right: Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates.

Left: Each layer in the pre-trained model contains useful functionality for solving target tasks.
Right: The performance of these heuristics is stable and steadily increases over training suggesting that generative pretraining supports the learning of a wide variety of task relevant functionality.

3.3. Ablation Study

**Analysis of various model ablations on different tasks**

The auxiliary objective helps on the NLI tasks and QQP. Overall, the trend suggests that larger datasets benefit from the auxiliary objective but smaller datasets do not.
A 5.6 average score drop is observed when using the LSTM instead of the Transformer. The LSTM only outperforms the Transformer on one dataset — MRPC.
Finally, the lack of pre-training hurts performance across all the tasks, resulting in a 14.8% decrease compared to the full model.

Reference

[2018 OpenAI] [GPT]
Improving Language Understanding by Generative Pre-Training

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] 2018 [GLUE] [T-DMCA] [GPT]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

Review — GPT: Improving Language Understanding by Generative Pre-Training

Pretraining Using T-DMCA, a Kind of Transformer, for Other Downstream Tasks

Outline

1. GPT Framework

1.1. Unsupervised Pre-Training

1.2. Supervised Fine-Tuning

2. Task-Specific Input Transformations

3. Experimental Results

3.1. SOTA Comparison

3.2. Analysis

3.3. Ablation Study

Reference

Natural Language Processing (NLP)

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet