Brief Review — LiveQA: Overview of the Medical Question Answering Task at TREC 2017 LiveQA

LiveQA Medical QA Dataset

Sik-Ho Tsang
5 min readOct 18, 2023
LiveQA Slides (From Author’s ResearchGate)

Overview of the Medical Question Answering Task at TREC 2017 LiveQA
, by U.S. National Library of Medicine, Emory University, and Georgia Institute of Technology
2017 TREC, Over 40 Citations (Sik-Ho Tsang @ Medium)

Medical LLM
2020 [BioBERT] [BEHRT] 2021 [MedGPT] 2023 [Med-PaLM]
==== My Other Paper Readings Are Also Over Here ====

  • LiveQA is a medical question answering task organized at the TREC 2017 LiveQA track, to address the automatic answering of consumer health questions received by the U.S. National Library of Medicine (NLM).
  • Both training question-answer pairs, and test questions with reference answers, are provided. All questions were manually annotated with the main entities (foci) and question types.
  • (This dataset is evaluated by Med-PaLM.)


  1. LiveQA Dataset
  2. Results

1. LiveQA Dataset

1.1. Task Descriptions

The medical QA task was introduced in 2017 based on questions received by the U.S. National Library of Medicine (NLM).

The medical task at TREC 2017 LiveQA was organized in the scope of the Consumer Health Question Answering (CHQA) project, which addresses the classification of customers’ requests and the automatic answering of Consumer Health Questions (CHQs).

  • The question below presents a concrete example of a CHQ looking for treatments of “retinitis pigmentosa”:
1 CHQ Example
  • Two more examples of such questions are presented below. The first CHQ asks about a Problem (“abetalipoproteimemia”) and includes more than one subquestion (Diagnosis and Management). The second CHQ includes one subquestion asking about the Ingredients of a Drug (Kapvay).
2 More CHQ Examples

Consumer health questions may contain multiple foci and question types.

Users can also describe general and background information such as their medical history before asking their questions, which increases the number of potentially irrelevant medical entities mentioned in the question.

1.2. Training Datasets

Example of the first training dataset
Foci are highlighted in blue, question types and their triggers in red and keywords in green
  • Two training sets with 634 pairs of medical questions and answers are provided.
  • Additional annotations are provided for the Question Focus and the Question Type used to define each subquestion.
  • Training questions cover 4 categories of foci (Disease, Drug, Treatment and Exam) and 23 question types (e.g. Treatment, Cause, Indication, dosage).

The first training dataset consists of 388 (sub)question-answer pairs corresponding to 200 NLM questions. QA pairs were constructed from FAQs on trusted websites of the U.S. National Institutes of Health (NIH).

The second training dataset consists of 246 question-answer pairs corresponding to 246 NLM questions. Answers were retrieved manually by librarians using PubMed and web search engines.

1.3. Test Dataset

The test set consists of 104 NLM questions. The subquestion, focus and type annotations were not provided to the participants.

The test set covers a wide range of question types (26) and have a slightly different distribution than the training questions in order to evaluate the scalability of the proposed systems.

  • Below shows some statistics of test dataset.
Question Types in Test Dataset
Foci in Test Dataset
Categories in Test Dataset

2. Results

2.1. Metrics Used in Main TREC LiveQA Challenge

  • avgScore [0–3 range]: the average score over all questions, transferring 1–4 level grades to 0–3 scores. This is the main score used to rank LiveQA runs.
  • succ@i+: the number of questions with score i or above (i ∈ 2,4) divided by the total number of questions.
  • prec@i+: the number of questions with score i or above (i ∈ 2,4) divided by number of questions answered by the system.
Average Score, Success & Precision

Table 2: CMU-OAQA [14] achieved the best Average Score of 0.637. They used an attentional encoder-decoder model for paraphrase identification and answer ranking. Quora question-similarity dataset was used for training.

  • The PRNA system [6] achieved the second best performance in the medical task with 0.49 avgScore (prna-r1). They used Wikipedia as the first answer source and Yahoo and Google searches as secondary answer sources. To extract the answer from the selected text passage, a bidirectional attention model trained on the SQuAD dataset was used.
  • Another technique was used by ECNU-ICA team [3] based on learning question similarity via two long short-term memory (LSTM) networks applied to obtain the semantic representations of the questions.
  • The CMU-LiveMedQA team [16] obtained an avgScore of 0.353. They used a convolutional neural network (CNN) model to classify a question into a restricted set of 10 question types and crawled “relevant” online web pages to find the answers.

There is a current gap in performance between the open-domain task and the medical task, which urges the need for larger medical datasets.

  • (The above methods or models are having low score before the invention of LLM.)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.