Review: GPT-2 (NLP)

GPT-2, Much Larger Model Than GPT-1, Trained on Much Larger Data

Sik-Ho Tsang
4 min readFeb 19, 2022
GPT-2 (Image from Jay Alammar)

Language Models are Unsupervised Multitask Learners
GPT-2, by OpenAI
OpenAI Tech Report, Over 2400 Citations (Sik-Ho Tsang @ Medium)
Language Model, Natural Language Processing, NLP, Generative Pre-Training, GPT

  • GPT-2 has no major architecture changes but much larger model than GPT-1. ALso, GPT-2 is trained using a new larger dataset of millions of webpages called WebText.
  • SOTA performance is obtained with zero-shot task transfer.


  1. GPT-2 Models
  2. WebText Dataset
  3. Experimental Results

1. GPT-2 Models

GPT-2 Variants (Image from Jay Alammar)
  • GPT-2 uses Transformer decoder as the model architecture which is the same as GPT-1 except the changes in dimensionality, the number of decoders and some minor changes.
  • Layer normalization was moved to the input of each sub-block, similar to a Pre-Activation ResNet and an additional layer normalization was added after the final self-attention block.
  • The vocabulary is expanded to 50,257. The context size is also increased from 512 to 1024 tokens and a larger batch size of 512 is used.
  • The smallest model is equivalent to the original GPT-1.
  • The second smallest equivalent to the largest model from BERT.
  • The largest model has over an order of magnitude more parameters than GPT-1.

(In this paper, authors also do not mention the model architecture too much since it is quite close to the one in GPT-1. But if you’re interested in the model architecture, I strongly recommend Jay Alammar’s article, which explains the GPT-2 model architecture in details. It’s a very good article.)

2. WebText Dataset

  • A new web scrape is created which emphasizes document quality.
  • To do this, only web pages are scraped which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, all outbound links are scraped from Reddit, a social media platform, which received at least 3 karma.
  • The resulting dataset, WebText, contains the text subset of these 45 million links. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of 40 GB of text. All Wikipedia documents are removed from WebText since it is a common data source for other datasets.

3. Experimental Results

3.1. Zero-Shot Results on Downstream Tasks

Zero-shot results on many datasets. No training or fine-tuning was performed.
  • GPT-2 model is still significantly worse than prior work on the One Billion Word Benchmark. This is likely due to a combination of it being both the largest dataset and having some of the most destructive pre-processing — 1BW’s sentence level shuffling removes all long-range structure.
  • (There are very detailed results and analyses for each downstream task. As there are too many pages to describe for each task, I don’t mention task by task here..)

In summary, WebText LMs transfer well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting.

3.2. WebText Perplexity

The performance of LMs trained on WebText as a function of model size.
  • As shown above, the performance on both the training and test sets of WebText are similar and improve together as model size is increased.

This suggests GPT-2 is still underfitting on WebText in many ways.

3.3. Machine Translation

Zero-shot performance

On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model, achieving 11.5 BLEU.

  • This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lample et al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach (Artetxe et al., 2019).


[2019 OpenAI] [GPT-2]
Language Models are Unsupervised Multitask Learners

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] 2018 [GLUE] [T-DMCA] [GPT] [ELMo] 2019 [T64] [Transformer-XL] [BERT] [RoBERTa] [GPT-2]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE] [GMNMT]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.