Review — SpanBERT: Improving Pre-training by Representing and Predicting Spans

SpanBERT, Mask a Span of Tokens for Pretraining

4 min readNov 19, 2022

--

SpanBERT: Improving Pre-training by Representing and Predicting Spans
SpanBERT, by University of Washington, Princeton University, Allen Institute of Artificial Intelligence, and Facebook AI Research
2020 TACL, Over 1100 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, LM, BERT

SpanBERT extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span.

Outline

SpanBERT
Results

1. SpanBERT

**SpanBERT training.** The **span** an American football game is **masked**. The **SBO** uses the output representations of the boundary tokens, **x4 and x9 (in blue), to predict each token in the masked span**. The equation shows the **MLM and SBO loss** terms for predicting the token, football (in pink), which as marked by the position embedding p3, is the third token from x4.

1.1. Span Masking

A span length (number of words) is first sampled from a geometric distribution l ∼ Geo(p), which is skewed towards shorter spans.

p=0.2, and also clip l at max=10. This yields a mean span length of mean(l)=3.8. The above figure shows the distribution of span mask lengths.
Similar to BERT, SpanBERT also masks 15% of the tokens in total: replacing 80% of the masked tokens with [MASK], 10% with random tokens, and 10% with the original tokens.

1.2. Span Boundary Objective (SBO)

A span boundary objective (SBO) is introduced that involves predicting each token of a masked span using only the representations of the observed tokens at the boundaries.
Given a masked span of tokens (xs, . . . , xe) ∈ Y, where (s, e) indicates its start and end positions, the predictions within the span are only coming from the external boundary tokens xs−1 and xe+1, as well as the position embedding of the target token pi−s+1:

where f is a 2-layer feed-forward network with GELU activations and layer normalization:

Cross-entropy loss is used, which is exactly like the MLM objective.
And SpanBERT sums the loss from both the span boundary and the regular masked language model objectives for each token xi in the masked span:

1.3. Single-Sequence Training

Instead of using NSP which uses 2 segments for pretraining, as in BERT, SpanBERT simply samples a single contiguous segment of up to n=512 tokens. n can be much longer compared with BERT since there is only 1 segment.
It is conjectured that single-sequence training is superior to bi-sequence training with NSP because (a) the model benefits from longer full-length contexts, or (b) conditioning on, often unrelated, context from another document adds noise to the masked language model.

2. Results

2.1. SQuAD

**Test results on SQuAD 1.1 and SQuAD 2.0.**

SpanBERT exceeds the BERT baseline by 2.0% and 2.8% F1, respectively (3.3% and 5.4% over Google BERT).

2.2. QA

**Performance (F1) on the five MRQA extractive question answering tasks.**

This trend goes beyond SQuAD, and is consistent in every MRQA dataset. On average, we see a 2.9% F1 improvement from reimplementation of BERT.

2.3. Coreference Resolution

**Performance on the OntoNotes coreference resolution benchmark**

SpanBERT improves considerably on top of that, achieving a new state of the art of 79.6% F1 (previous best result is 73.0%).

2.4. Relation Extraction

**Test performance on the TACRED relation extraction benchmark**

SpanBERT exceeds the reimplementation of BERT by 3.3% F1 and achieves close to the current state of the art.

2.5. GLUE

**Test set performance on** **GLUE** **tasks**

The main gains from SpanBERT are in the SQuAD-based QNLI dataset (+1.3%) and in RTE (+6.9%), the latter accounting for most of the rise in SpanBERT’s GLUE average.

2.6. Ablation Study

**The effect of replacing** **BERT’s original masking scheme (Subword Tokens) with different masking schemes.**

Geometric spans outperforms other span variants.

**The effects of different auxiliary objectives**

Single-sequence training typically improves performance. Adding SBO further improves performance, with a substantial gain on coreference resolution (+2.7% F1) over span masking alone.

Reference

[2020 TACL] [SpanBERT]
SpanBERT: Improving Pre-training by Representing and Predicting Spans

2.1. Language Model / Sequence Model

(Some are not related to NLP, but I just group them here)

1991 … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT]

Review — SpanBERT: Improving Pre-training by Representing and Predicting Spans

SpanBERT, Mask a Span of Tokens for Pretraining

Outline

1. SpanBERT

1.1. Span Masking

1.2. Span Boundary Objective (SBO)

1.3. Single-Sequence Training

2. Results

2.1. SQuAD

2.2. QA

2.3. Coreference Resolution

2.4. Relation Extraction

2.5. GLUE

2.6. Ablation Study

Reference

2.1. Language Model / Sequence Model

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet