Brief Review — SciBERT: A Pretrained Language Model for Scientific Text

SciBERT, Pretraining BERT on Scientific Text

3 min readNov 17, 2023

SciBERT: A Pretrained Language Model for Scientific Text
SciBERT, by Allen Institute for Artificial Intelligence
2019 EMNLP IJCNLP, Over 2500 Citations (Sik-Ho Tsang @ Medium)
Language Model
2007 … 2022 [GLM] [Switch Transformers] [WideNet] [MoEBERT] [X-MoE] 2023 [ERNIE-Code]
Medical NLP/LLM
2017 … 2019 [MedicationQA] [G-BERT] [PubMedQA] 2020 [BioBERT] [BEHRT] 2021 [MedGPT] [Med-BERT] [MedQA] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

SciBERT leverages unsupervised pretraining BERT on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks.

Outline

SciBERT
Results

1. SciBERT

1.1. Vocabulary

The original BERT uses BaseVocab.

SciVocab, a new WordPiece vocabulary, is constructed on the proposed scientific corpus using the SentencePiece library.

Both cased and uncased vocabularies are produced and the vocabulary size is set to 30K to match the size of BaseVocab. The resulting token overlap between BaseVocab and SciVocab is 42%, illustrating a substantial difference.

1.2. Corpus

SciBERT is trained on a random sample of 1.14M papers from Semantic Scholar.
This corpus consists of 18% papers from the computer science (CS) domain and 82% from the broad biomedical (Bio) domain.

The full text of the papers is used, not just the abstracts.
The average paper length is 154 sentences (2,769 tokens) resulting in a corpus size of 3.17B tokens, similar to the 3.3B tokens on which BERT was trained. Sentences are split using ScispaCy.

2. Results

2.1. Dataset

There are core downstream NLP tasks:

Named Entity Recognition (NER)
PICO Extraction (PICO)
Text Classification (CLS)
Relation Classification (REL)
Dependency Parsing (DEP)

In all settings, a Dropout of 0.1 and cross entropy loss using Adam are applied. 2 to 5 epochs are used for fine-tuning using a batch size of 32.

2.2. Fine-Tuning Setting

For text classification (i.e. CLS and REL), the final BERT vector for the [CLS] token is fed into a linear classification layer.
For sequence labeling (i.e. NER and PICO), the final BERT vector is fed for each token into a linear classification layer with softmax output.

2.3. Frozen Setting

For text classification, each sentence of BERT vectors is fed into a 2-layer BiLSTM of size 200 and a multilayer perceptron (with hidden size 200) is applied on the concatenated first and last BiLSTM vectors.
For sequence labeling, the same BiLSTM layers are used and a conditional random field (CRF) is used to guarantee well-formed predictions.