Review — LinkBERT: Pretraining Language Models with Document Links

LinkBERT & BioLinkBERT Are Proposed

Sik-Ho Tsang
6 min readJan 16, 2024
Document links (e.g. hyperlinks) can provide salient multi-hop knowledge.

LinkBERT: Pretraining Language Models with Document Links
, by Stanford University
2022 ACL, Over 160 Citations (Sik-Ho Tsang @ Medium)

Language Model
2007 … 2022
[GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] [sMLP] 2023 [ERNIE-Code]

Medical/Clinical NLP/LLM
2023 [MultiMedQA, HealthSearchQA, Med-PaLM] [Med-PaLM 2] [GPT-4 in Radiology] [ChatGPT & GPT‑4 on USMLE]
==== My Other Paper Readings Are Also Over Here ====

  • LinkBERT is proposed wherein an effective LM pretraining method is designed to incorporate document links, such as hyperlinks.
  • Given a pretraining corpus, it is viewed as a graph of documents, and LM inputs are created by placing linked documents in the same context.
  • Biomedical LinkBERT, BioLinkBERT, is also proposed on BioNLP tasks.


  1. LinkBERT
  2. Biomedical LinkBERT (BioLinkBERT)
  3. LinkBERT Results
  4. BioLinkBERT Results

1. LinkBERT

LinkBERT Overview

1.1. Graph & Directed Edges

  • Instead of viewing the pretraining corpus as a set of documents X={X(i)}, LinkBERT views it as a graph of documents, G=(X, E), where E={(X(i), X(j))} denotes links between documents.

A directed edge (X(i), X(j)) is created if there is a hyperlink from document X(i) to document X(j).

  • For each document X(i), the common TF-IDF cosine similarity metric (Chen et al., 2017; Yasunaga et al., 2017) is used to obtain top-k documents X(j)’s and make edges (X(i), X(j)). In this paper, k=5.

1.2. Pretraining Inputs & Tasks

Language Model (LM) inputs are created by placing linked documents in the same context window, besides the existing option of a single document or random documents.

  • Specifically, an anchor text segment is first sampled from the corpus (Segment A; XAX(i)).
  • For the next segment (Segment B; XB), LinkBERT either (1) uses the contiguous segment from the same document (XBX(i)), (2) samples a segment from a random document (XBX(j) where j/=i), or (3) samples a segment from one of the documents linked from Segment A (XBX(j) where (X(i), X(j)) ∈ E).
  • The two segments are then joined via special tokens to form an input instance: [CLS] XA [SEP] XB [SEP].

Besides the Masked Language Modeling (MLM) in BERT is used, a Document Relation Prediction (DPR) objective is also proposed, which classifies the relation r of segment XB to segment XA.

By distinguishing linked from contiguous and random, DRP encourages the LM to learn the relevance and existence of bridging concepts between documents.

  • To predict r, the representation of [CLS] token is used, as in Next sentence prediction (NSP) in BERT. Taken together, the training objective optimizes:

1.3. Strategy to Obtain Linked Documents

  • 3 key axes are considered to obtain useful linked documents in this process.

1.3.1. Relevance

Relevance can be achieved by using hyperlinks or lexical similarity metrics, and both methods yield substantially better performance than using random links.

1.3.2. Salience

  • Besides relevance, another factor to consider (salience) is whether the linked document can offer new, useful knowledge that may not be obvious to the current LM.

Again, it is found that using hyperlinks yields a more performant LM.

1.3.3. Diversity

  • In the document graph, some documents may have a very high in-degree (e.g., many incoming hyperlinks, like the “United States” page ofWikipedia), and others a low in-degree. Documents of high in-degree can be sampled too often in the overall training data, losing diversity.

LinkBERT samples a linked document with probability inversely proportional to its in-degree, which yields better LM performance.

2. Biomedical LinkBERT (BioLinkBERT)

  • Biomedical LMs are typically trained on PubMed, which contains abstracts and citations of biomedical papers.

LinkBERT is pretrained on PubMed with citation links, which termed BioLinkBERT.

  • BioLinkBERT of -base size (110M params) and -large size (340M params) are pretrained from scratch.

3. LinkBERT Results

3.1. LinkBERT Results

  • LinkBERT is pretrained of 3 sizes, -tiny, -base and -large, following the configurations of BERTtiny (4.4M parameters), BERTbase (110M params), and BERTlarge (340M params).

Table 1: On MRQA, LinkBERT substantially outperforms BERT on all datasets.
Table 2: On GLUE, LinkBERT performs moderately better than BERT.

Multi-Hop Reasoning on HotpotQA
  • LinkBERT correctly predicts the answer in the second document (“Montreal”).

The intuition is that because LinkBERT is pretrained with pairs of linked documents rather than purely single documents, it better learns how to flow information (e.g., do attention) across tokens when multiple related documents are given in the context.

SQuAD & Few-Shot QA Performance

Table 3: SQuAD dataset is modified such that 1–2 distracting documents are prepended or appended to the original document given to each question. While BERT incurs a large performance drop (-2.8%), LinkBERT is robust to distracting documents (-0.5%). The intuition behind is that the DRP objective helps the LM to better recognize document relations.

Table 4: LinkBERT attains more significant gains over BERT. This result suggests that LinkBERT internalizes more knowledge than BERT during pretraining.

3.2. LinkBERT Ablation Studies

LinkBERT Ablation Studies

Table 5: The intuition is that hyperlinks can provide more salient knowledge that may not be obvious via lexical similarity alone. Nevertheless, using lexical similarity links is substantially better than BERT (+2.3%).

Table 6: Removing DRP in pretraining hurts downstream QA performance. DRP facilitates LMs to learn document relations.

4. BioLinkBERT Results

BLURB, MedQA-USMLE, MMLU-Professional Medicine

Table 7: BioLinkBERTbase outperforms PubMedBERTbase on all task categories, attaining a performance boost of +2% absolute on average. Moreover, BioLinkBERTlarge provides a further boost of +1%. In total, BioLinkBERT outperforms the previous best by +3% absolute, establishing new state-of-the-art on the BLURB leaderboard.

Table 8: BioLinkBERTbase obtains a 2% accuracy boost over PubMedBERTbase, and BioLinkBERTlarge provides an additional +5% boost. In total, BioLinkBERT outperforms the previous best by +7% absolute, attaining new state-of-the-art.

Table 9: BioLinkBERTlarge achieves 50% accuracy on this QA task, significantly outperforming the largest general-domain LM or QA models such as GPT-3 175B params (39% accuracy) and UnifiedQA 11B params (43% accuracy).

Multi-hop reasoning on MedQA-USMLE
  • Existing PubMedBERT tends to simply predict a choice that contains a word appearing in the question (“blood” for choice D).

BioLinkBERT correctly predicts the answer (B). The intuition is that citation links bring relevant documents and concepts together in the same context in pretraining.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.