Brief Review — COBERT: COVID-19 Question Answering System Using BERT

COBERT, Answers Generated by DistilBERT

Sik-Ho Tsang
4 min readJul 15, 2024
COBERT Pipeline

COBERT: COVID-19 Question Answering System Using BERT
COBERT
, by l-Balqa Applied University, Bharati Vidyapeeth’s College of Engineering, SRM Institute of Science and Technology, Chandigarh University
2021 Arab. J. Sci. Eng., Over 60 Citations (Sik-Ho Tsang @ Medium)

Question Answering (QA)
2016 [SQuAD 1.0/1.1] 2017 [Dynamic Coattention Network (DCN)] 2018 [SQuAD 2.0]
My Healthcare and Medical Related Paper Readings
==== My Other Paper Readings Are Also Over Here ====

  • COBERT is proposed, which is a retriever-reader dual algorithmic system that answers the complex queries by searching a document of 59K corona virus-related literature made accessible through the Coronavirus Open Research Dataset Challenge (CORD-19).
  • The retriever is composed of a TF-IDF vectorizer capturing the top 500 documents with optimal scores.
  • The reader which is pre-trained Bidirectional Encoder Representations from Transformers (BERT) on SQuAD 1.1 dev dataset.

Outline

  1. COBERT
  2. Results

1. COBERT

1.1. Open-Domain QA vs Closed Domain QA

Open-Domain QA vs Closed Domain QA
  • A closed-domain system deals with questions under an exact domain, i.e., medicine or automotive maintenance.
  • It is largely different from open-domain QA.

1.2. COBERT Retriever

COBERT Retriever
  • Basically, it converts these documents into Tf-IDF vectorizer along with an input query and evaluates the cosine similarity of the document and input query.
  • TF-IDF features are based on uni grams and bi-grams. TF-IDF Retriever segregated into two terms. TF (Term Frequency) and IDF (Inverse Document Frequency):
  • Further, this schema calculates the cosine similarity in-between the question sentence and every document of the database:
  • where:
  • For the closed domain, the 59,000 articles from CORD-19 are included as the corpus. As the whole text of the paragraphs is used, the data is divided into chunks of 10,000 for pre-processing to avoid system crashes.
  • The top articles identified based on a cosine similarity between the question string and the abstract text.

1.3. COBERT Reader

COBERT Reader
  • The reader used themost likely document, which is ranked by the retriever. To answer the question, the reader splits the document into a paragraph.

DistilBERT pre-trained on SQuAD 1.1 is used. The model has 6 layers, 768 dimensions, and 12 heads, totalizing 66M parameters.

  • The top 3 answers are selected based on a weighted score between the retriever score and reader score. Retriever score weight is set as 0.35.

2. Results

2.1. Qualitative Results

  • The system outputs not only an answer but also the title of the document/article, the paragraph where the answer was found.
  • Every query, model outputs 3 answers. For the sake of simplicity, only a single instance of the answer is presented to the query.

2.2. Quantitative Results

DistilBERT achieves 80.1% EM and 87.5% F1 while being much faster and lighter. Whereas, after fitting the pipeline on the CORD-19 corpus, the model achieves 81.6% EM and 87.3% F1-score.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet