Review — RoBERTa: A Robustly Optimized BERT Pretraining Approach

RoBERTa, Better Hyperparameters to pretrain BERT

Sik-Ho Tsang
4 min readFeb 13, 2022

RoBERTa: A Robustly Optimized BERT Pretraining Approach
RoBERTa, by University of Washington, and Facebook AI
2019 arXiv, Over 2700 Citations (

@ Medium)
Natural Language Processing, NLP, Language Model, BERT

  • A replication study of BERT pretraining is done to carefully measure the impact of many key hyperparameters and training data size.
  • The original BERT was significantly undertrained, and can match or exceed the performance of every model published after it with better pretraining.


  1. RoBERTa Modification Summary
  2. Robustly optimized BERT approach (RoBERTa)
  3. Experimental Results

1. RoBERTa Modification Summary

  • The modifications are simple, they include:
  1. Training the model longer, with bigger batches, over more data.
  2. Removing the next sentence prediction objective.
  3. Training on longer sequences.
  4. Dynamically changing the masking pattern for the training data.
  5. A large new dataset (CC-NEWS) of comparable size to other privately used datasets is collected, to better control for training set size effects.

2. Robustly optimized BERT approach (RoBERTa)

2.1. Static vs. Dynamic Masking

  • The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask.
  • For dynamic masking, the masking pattern is changed every time when a sequence is fed to the model.
Comparison between static and dynamic masking for BERTBASE.
  • The reimplementation with static masking performs similar to the original BERT model, and dynamic masking is comparable or slightly better than static masking.

2.2. Model Input Format and Next Sentence Prediction

  • In the original BERT pretraining procedure, the model observes two concatenated document segments, which are either sampled contiguously from the same document (with p = 0.5) or from distinct documents.
  • The model is trained to predict whether the observed document segments come from the same or distinct documents via an auxiliary Next Sentence Prediction (NSP) loss.
  • In RoBERTa (DOC-SENTENCES), each input is packed with full sentences sampled contiguously from one document. The NSP loss is removed.
  • DOC-SENTENCES obtains better performance.

2.3. Training with Large Batches

Perplexity on held-out training data (ppl) and development set accuracy for base models trained over BOOKCORPUS and WIKIPEDIA with varying batch sizes (bsz).
  • Originally, BERTBASE is trained for 1M steps with a batch size of 256 sequences.
  • Training with large batches improves perplexity for the masked language modeling objective, as well as end-task accuracy.
  • Large batches are also easier to parallelize via distributed data parallel training, and RoBERTa trains with batches of 8K sequences.

2.4. Text Encoding

  • The original BERT uses a character-level BPE vocabulary of size 30K.
  • RoBERTa instead considers training BERT with a larger byte-level BPE vocabulary containing 50K subword units.

2.5. CC-NEWS

  • CC-NEWS is collected from the English portion of the CommonCrawl News dataset (Nagel, 2016). The data contains 63 million English news articles crawled between September 2016 and February 2019. (76GB after filtering).

3. Experimental Results

Development set results for RoBERTa as pretraining over more data (16GB160GB of text) and pretrain for longer (100K300K500K steps)

RoBERTa provides a large improvement over the originally reported BERTLARGE results.

  • Further improvements are observed in performance across all downstream tasks, validating the importance of data size and diversity in pretraining.
Results on GLUE. All results are based on a 24-layer architecture.
  • In the first setting (single-task, dev), RoBERTa achieves state-of-the-art results on all 9 of the GLUE task development sets.
  • In the second setting (ensembles, test), RoBERTa is submitted to the GLUE leaderboard and achieve state-of-the-art results on 4 out of 9 tasks and the highest average score to date (at that moment). This is especially exciting because RoBERTa does not depend on multi-task finetuning.
Results on SQuAD
  • On the SQuAD v1.1 development set, RoBERTa matches the state-of-the-art set by XLNet.
  • On the SQuAD v2.0 development set, RoBERTa sets a new state-of-the-art, improving over XLNet by 0.4 points (EM) and 0.6 points (F1).
Results on the RACE test set
  • RoBERTa achieves state-of-the-art results on both middle-school and high-school settings.

Though RoBERTa improves BERT with SOTA results, it is unfortunate that RoBERTa is rejected in 2020 ICLR since the reviewers think that most of the findings are obvious (careful tuning helps, more data helps). And the novelty and technical contributions are rather limited. (From OpenReview)


[2019 arXiv] [RoBERTa]
RoBERTa: A Robustly Optimized BERT Pretraining Approach

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] 2018 [GLUE] [T-DMCA] [GPT] [ELMo] 2019 [T64] [Transformer-XL] [BERT] [RoBERTa]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE] [GMNMT]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.