Brief Review — BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Pretraining BERT With PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC) as BioBERT
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
BioBERT, by Korea University, Naver Corp
2020 Oxford Academics Bioinformatics, Over 4100 Citations (Sik-Ho Tsang @ Medium)
Medical Large Language Model (LLM)
==== My Other Paper Readings Are Also Over Here ====
- BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is introduced, which is a domain-specific language representation model pre-trained on large-scale biomedical corpora.
- The overall process of pre-training and fine-tuning BioBERT is ashown above.
- First, BioBERT is intialized with weights from BERT, which was pretrained on general domain corpora (English Wikipedia and BooksCorpus).
- Then, BioBERT is pre-trained on biomedical domain corpora (PubMed abstracts and PMC full-text articles).
- Finally, BioBERT is fine-tuned and evaluated on three popular biomedical text mining tasks (NER, RE and QA).
BioBERT is the first domain-specific BERT based model pretrained on biomedical corpora for 23 days on eight NVIDIA V100 GPUs.
1.2. Pretraining Corpora
Table 1: BioBERT is pre-trained on PubMed abstracts (PubMed) and PubMed Central full-text articles (PMC).
- Table 2: Different combinations of corpora are tried.
- BioBERT v1.0 is trained for 200K, 270K or 470K steps based on combinations of corpora and BioBERT v1.1 is trained for 1M steps.
1.3. Fine-Tuning BioBERT
- Named entity recognition (NER) is one of the most fundamental biomedical text mining tasks, which involves recognizing numerous domain-specific proper nouns in a biomedical corpus. Entity level precision, recall and F1 score are used as metrics.
- Relation extraction (RE) is a task of classifying relations of named entities in a biomedical corpus. The precision, recall and F1 scores on the RE task are reported.
- Question answering (QA) is a task of answering questions posed in natural language given related passages. Strict accuracy, lenient accuracy and mean reciprocal rank (MRR) are reported.
BioBERT achieves higher scores than BERT on all the datasets. BioBERT outperformed the state-of-the-art models on 6 out of 9 datasets.
BioBERT v1.1 (+ PubMed) outperformed the state-of-the-art models by 0.62 in terms of micro averaged F1 score.
On average (micro), BioBERT v1.0 (+PubMed) obtained a higher F1 score (2.80 higher) than the state-of-the-art models. Also, BioBERT achieved the highest F1 scores on 2 out of 3 biomedical datasets.
- All versions of BioBERT significantly outperformed BERT and the state-of-the-art models, and in particular, BioBERT v1.1 (+PubMed) obtained a strict accuracy of 38.77, a lenient accuracy of 53.81 and a mean reciprocal rank score of 44.77, all of which were micro averaged.
- On all the biomedical QA datasets, BioBERT achieved new state-of-the-art performance in terms of MRR.
2.4. Ablation Studies
(a): Pre-training on 1 billion words is quite effective, and the performance on each dataset mostly improves until 4.5 billion words.
(b): The performance on each dataset improves as the number of pre-training steps increases.
(c): The absolute performance improvements of BioBERT v1.0 (+PubMed+PMC) over BERT on all 15 datasets.
2.5. Prediction Samples
BioBERT can recognize biomedical named entities that BERT cannot and can find the exact boundaries of named entities.
While BERT often gives incorrect answers to simple biomedical questions, BioBERT provides correct answers to such questions.
Also, BioBERT can provide longer named entities as answers.