Brief Review — DCN: Dynamic Coattention Networks For Question Answering
DCN for Question Answering (QA)
Dynamic Coattention Networks For Question Answering
DCN, by Salesforce Research
2017 ICLR, Over 670 Citations (Sik-Ho Tsang @ Medium)Question Answering (QA)
2016 [SQuAD 1.0/1.1] 2018 [SQuAD 2.0]
- Dynamic Coattention Network (DCN) is proposed for question answering.
- The DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then, a dynamic pointing decoder iterates over potential answer spans.
- This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers.
- (Before the invention of LLM, it is a hard work for question answering.)
Outline
- Dynamic Coattention Network (DCN)
- Results
1. Dynamic Coattention Network (DCN)
1.1. Document and Question Encoders
Let (xQ1, xQ2, …, xQn) denote the sequence of word vectors corresponding to words in the question and (xD1, xD2, …, xDm) denote the same for words in the document. Using LSTM, the document is encoded as:
- The document encoding matrix is defined as D=[d1, …, dm, d∅] of size l×(m+1). A sentinel vector d∅ is added, which allows the model to not attend to any particular word.
The question embeddings are computed with the same LSTM to share representation power:
- An intermediate question representation becomes Q’=[q1, …, qn, q∅].
To allow for variation between the question encoding space and the document encoding space, a non-linear projection layer is introduced on top of the question encoding. The final representation for the question becomes:
1.2. Coattentional Encoder
A coattention mechanism is proposed that attends to the question and document simultaneously, and finally fuses both attention contexts.
- First, the affinity matrix is computed: (L^T)×Q where T is transpose.
The affinity matrix is normalized row-wise to produce the attention weights A^Q across the document for each word in the question, and column-wise to produce the attention weights A^D across the question for each word in the document:
- Next, the summaries, or attention contexts, of the document are computed in light of each word of the question:
- The summaries (C^Q)×(A^D) of the previous attention contexts are also computed in light of each word of the document. It can be interpretation is the mapping of question encoding into space of document encodings.
- C^D, a co-dependent representation of the question and document, is defined, as the coattention context. […] means concatenation.
- The last step is the fusion of temporal information to the coattention context via a bidirectional LSTM:
- U = [u1, …, um] is defined, which provides a foundation for selecting which span may be the best possible answer, as the coattention encoding.
1.3. Dynamic Pointing Decoder
- Due to the nature of SQuAD, an intuitive method for producing the answer span is by predicting the start and end points of the span. However, given a question-document pair, there may exist several intuitive answer spans within the document, each corresponding to a local maxima.
An iterative technique is proposed to select an answer span by alternating between predicting the start point and predicting the end point.
- Let hi, si, and ei denote the hidden state of the LSTM, the estimate of the position, and the estimate of the end position during iteration i. The LSTM state update is:
- where usi-1 and uei-1 are the representations corresponding to the previous estimate of the start and end positions in the coattention encoding U.
- Given the current hidden state hi, previous start position usi-1, and previous end position uei-1, the current start position and end position are estimated:
- where αt and βt represent the start score and end score corresponding to the t-th word in the document, computed with separate neural networks.
- Based on the strong empirical performance of Maxout Networks and Highway Networks, a Highway Maxout Network (HMN) is proposed to compute αt as:
- Similarly, the end score, βt, is computed but using a separate HMNend.
- The HMN model is as shown above and estimated as below:
2. Results
2.1. Some Details
- In practice, GloVe word vectors pretrained on the 840B Common Crawl corpus are used.
- A max sequence length of 600 is used during training and a hidden state size of 200 is used for all recurrent units, Maxout layers, and linear layers. All LSTMs have randomly initialized parameters and an initial state of zero.
- The maximum number of iterations is set to 4 and a Maxout pool size of 16 is used.
- EM and F1 metrics are used.
2.2. SQuAD
- At the time of writing, the single-model DCN ranks first at 66.2% exact match and 75.9% F1 on the test data among single-model submissions.
- The ensemble DCN ranks first overall at 71.6% exact match and 80.4% F1 on the test data.
2.3. Visualizations
The DCN has the capability to estimate the start and end points of the answer span multiple times, each time conditioned on its previous estimates. The model is able to explore local maxima corresponding to multiple plausible answers.
- (Please feel free to read the paper directly for other results.)