Review — SpanBERT: Improving Pre-training by Representing and Predicting Spans
SpanBERT, Mask a Span of Tokens for Pretraining
SpanBERT: Improving Pre-training by Representing and Predicting Spans
SpanBERT, by University of Washington, Princeton University, Allen Institute of Artificial Intelligence, and Facebook AI Research
2020 TACL, Over 1100 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Language Model, LM, BERT
- SpanBERT extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span.
Outline
- SpanBERT
- Results
1. SpanBERT
1.1. Span Masking
- A span length (number of words) is first sampled from a geometric distribution l ∼ Geo(p), which is skewed towards shorter spans.
- p=0.2, and also clip l at max=10. This yields a mean span length of mean(l)=3.8. The above figure shows the distribution of span mask lengths.
- Similar to BERT, SpanBERT also masks 15% of the tokens in total: replacing 80% of the masked tokens with [MASK], 10% with random tokens, and 10% with the original tokens.
1.2. Span Boundary Objective (SBO)
- A span boundary objective (SBO) is introduced that involves predicting each token of a masked span using only the representations of the observed tokens at the boundaries.
- Given a masked span of tokens (xs, . . . , xe) ∈ Y, where (s, e) indicates its start and end positions, the predictions within the span are only coming from the external boundary tokens xs−1 and xe+1, as well as the position embedding of the target token pi−s+1:
- where f is a 2-layer feed-forward network with GELU activations and layer normalization:
- Cross-entropy loss is used, which is exactly like the MLM objective.
- And SpanBERT sums the loss from both the span boundary and the regular masked language model objectives for each token xi in the masked span:
1.3. Single-Sequence Training
- Instead of using NSP which uses 2 segments for pretraining, as in BERT, SpanBERT simply samples a single contiguous segment of up to n=512 tokens. n can be much longer compared with BERT since there is only 1 segment.
- It is conjectured that single-sequence training is superior to bi-sequence training with NSP because (a) the model benefits from longer full-length contexts, or (b) conditioning on, often unrelated, context from another document adds noise to the masked language model.
2. Results
2.1. SQuAD
SpanBERT exceeds the BERT baseline by 2.0% and 2.8% F1, respectively (3.3% and 5.4% over Google BERT).
2.2. QA
This trend goes beyond SQuAD, and is consistent in every MRQA dataset. On average, we see a 2.9% F1 improvement from reimplementation of BERT.
2.3. Coreference Resolution
SpanBERT improves considerably on top of that, achieving a new state of the art of 79.6% F1 (previous best result is 73.0%).
2.4. Relation Extraction
SpanBERT exceeds the reimplementation of BERT by 3.3% F1 and achieves close to the current state of the art.
2.5. GLUE
The main gains from SpanBERT are in the SQuAD-based QNLI dataset (+1.3%) and in RTE (+6.9%), the latter accounting for most of the rise in SpanBERT’s GLUE average.
2.6. Ablation Study
Geometric spans outperforms other span variants.
Single-sequence training typically improves performance. Adding SBO further improves performance, with a substantial gain on coreference resolution (+2.7% F1) over span masking alone.
Reference
[2020 TACL] [SpanBERT]
SpanBERT: Improving Pre-training by Representing and Predicting Spans
2.1. Language Model / Sequence Model
(Some are not related to NLP, but I just group them here)
1991 … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT] [BART] [Longformer] [ELECTRA] [Megatron-LM] [SpanBERT]