Review — SpanBERT: Improving Pre-training by Representing and Predicting Spans

SpanBERT, Mask a Span of Tokens for Pretraining

Sik-Ho Tsang
4 min readNov 19, 2022

SpanBERT: Improving Pre-training by Representing and Predicting Spans
, by University of Washington, Princeton University, Allen Institute of Artificial Intelligence, and Facebook AI Research
2020 TACL, Over 1100 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, LM, BERT

  • SpanBERT extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span.


  1. SpanBERT
  2. Results

1. SpanBERT

SpanBERT training. The span an American football game is masked. The SBO uses the output representations of the boundary tokens, x4 and x9 (in blue), to predict each token in the masked span. The equation shows the MLM and SBO loss terms for predicting the token, football (in pink), which as marked by the position embedding p3, is the third token from x4.

1.1. Span Masking

  • A span length (number of words) is first sampled from a geometric distribution l ∼ Geo(p), which is skewed towards shorter spans.
  • p=0.2, and also clip l at max=10. This yields a mean span length of mean(l)=3.8. The above figure shows the distribution of span mask lengths.
  • Similar to BERT, SpanBERT also masks 15% of the tokens in total: replacing 80% of the masked tokens with [MASK], 10% with random tokens, and 10% with the original tokens.

1.2. Span Boundary Objective (SBO)

  • A span boundary objective (SBO) is introduced that involves predicting each token of a masked span using only the representations of the observed tokens at the boundaries.
  • Given a masked span of tokens (xs, . . . , xe) ∈ Y, where (s, e) indicates its start and end positions, the predictions within the span are only coming from the external boundary tokens xs−1 and xe+1, as well as the position embedding of the target token pis+1:
  • Cross-entropy loss is used, which is exactly like the MLM objective.
  • And SpanBERT sums the loss from both the span boundary and the regular masked language model objectives for each token xi in the masked span:

1.3. Single-Sequence Training

  • Instead of using NSP which uses 2 segments for pretraining, as in BERT, SpanBERT simply samples a single contiguous segment of up to n=512 tokens. n can be much longer compared with BERT since there is only 1 segment.
  • It is conjectured that single-sequence training is superior to bi-sequence training with NSP because (a) the model benefits from longer full-length contexts, or (b) conditioning on, often unrelated, context from another document adds noise to the masked language model.

2. Results

2.1. SQuAD

Test results on SQuAD 1.1 and SQuAD 2.0.

SpanBERT exceeds the BERT baseline by 2.0% and 2.8% F1, respectively (3.3% and 5.4% over Google BERT).

2.2. QA

Performance (F1) on the five MRQA extractive question answering tasks.

This trend goes beyond SQuAD, and is consistent in every MRQA dataset. On average, we see a 2.9% F1 improvement from reimplementation of BERT.

2.3. Coreference Resolution

Performance on the OntoNotes coreference resolution benchmark

SpanBERT improves considerably on top of that, achieving a new state of the art of 79.6% F1 (previous best result is 73.0%).

2.4. Relation Extraction

Test performance on the TACRED relation extraction benchmark

SpanBERT exceeds the reimplementation of BERT by 3.3% F1 and achieves close to the current state of the art.

2.5. GLUE

Test set performance on GLUE tasks

The main gains from SpanBERT are in the SQuAD-based QNLI dataset (+1.3%) and in RTE (+6.9%), the latter accounting for most of the rise in SpanBERT’s GLUE average.

2.6. Ablation Study

The effect of replacing BERT’s original masking scheme (Subword Tokens) with different masking schemes.

Geometric spans outperforms other span variants.

The effects of different auxiliary objectives

Single-sequence training typically improves performance. Adding SBO further improves performance, with a substantial gain on coreference resolution (+2.7% F1) over span masking alone.


[2020 TACL] [SpanBERT]
SpanBERT: Improving Pre-training by Representing and Predicting Spans

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

19912020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.