# Brief Review — DCN: Dynamic Coattention Networks For Question Answering

## DCN for Question Answering (QA)

Dynamic Coattention Networks For Question Answering, by Salesforce Research

DCN2017 ICLR, Over 670 Citations(Sik-Ho Tsang @ Medium)

Question Answering (QA)2016[SQuAD 1.0/1.1]2018[SQuAD 2.0]

**Dynamic Coattention Network (DCN)**is proposed for question answering.- The DCN
**first fuses co-dependent representations of the question and the document**in order to focus on relevant parts of both. Then, a**dynamic pointing decoder iterates over potential answer spans.** - This iterative procedure
**enables the model to recover from initial local maxima corresponding to incorrect answers**. - (Before the invention of LLM, it is a hard work for question answering.)

# Outline

**Dynamic Coattention Network (DCN)****Results**

**1. Dynamic Coattention Network (DCN)**

## 1.1. Document and Question Encoders

Let

(denote thexQ1,xQ2, …,xQn)sequence of word vectorscorresponding to words in thequestionand(denote the same forxD1,xD2, …,xDm)wordsin thedocument. Using LSTM, the document is encoded as:

- The
**document encoding matrix**is defined asof*D*=[*d*1, …,*dm*,*d∅*]**size**. A sentinel vector*l*×(*m*+1)*d∅*is added, which allows the model to not attend to any particular word.

The

question embeddingsare computed with the same LSTM to share representation power:

- An intermediate question representation becomes
.*Q’*=[*q*1, …,*qn*,*q∅*]

To allow for variation between the question encoding space and the document encoding space,

a non-linear projection layeris introduced on top of the question encoding. The final representation for the question becomes:

## 1.2. Coattentional Encoder

A coattention mechanism is proposed that

attends to the question and document simultaneously, and finallyfusesboth attention contexts.

- First, the affinity matrix is computed:
**(**where T is transpose.*L*^T)×*Q*

The affinity matrix is

normalized row-wisetoproduce the attention weights, andA^Qacross the document for each word in the questioncolumn-wise to produce the attention weightsacross the question for each word in the document:A^D

- Next,
**the summaries, or attention contexts, of the document are computed**in light of each word of the question:

**The summaries (***C*^*Q*)×(*A*^*D*) of the previous attention contexts are also**computed**in light of each word of the document. It can be interpretation is**the mapping of question encoding into space of document encodings.**

,*C*^*D***a co-dependent representation of the question and document**, is defined, as the coattention context. […] means concatenation.- The last step is the
**fusion of temporal information to the coattention context via a bidirectional LSTM:**

is defined, which provides*U*= [*u*1, …,*um*]**a foundation for selecting which span may be the best possible answer,**as the coattention encoding.

## 1.3. Dynamic Pointing Decoder

- Due to the nature of
**SQuAD**, an intuitive method for producing the answer span is by**predicting the start and end points of the span.**However, given a question-document pair, there**may exist several intuitive answer spans**within the document, each corresponding to a local maxima.

An

iterative techniqueis proposed toselect an answer span by alternating between predicting the start point and predicting the end point.

- Let
,*hi*, and*si*denote the*ei***hidden state of the LSTM**, the estimate of the position, and the estimate of the end position during iteration*i*. The**LSTM state update**is:

- where
and*usi*-1*uei*-1**the representations corresponding to the previous estimate of the start and end positions**in the coattention encoding*U*. - Given the current hidden state
*hi*, previous start position*usi*-1, and previous end position*uei*-1,**the current start position and end position are estimated:**

- where
and*αt*represent the*βt***start score**and**end score**corresponding to thein the document, computed with*t*-th word**separate neural networks.**

- Based on the strong empirical performance of Maxout Networks and Highway Networks, a
**Highway Maxout Network (HMN)**is proposed to computeas:*αt*

- Similarly, the
**end score,**is computed but using a separate HMNend.*βt*, - The HMN model is as shown above and estimated as below:

# 2. Results

## 2.1. Some Details

- In practice,
**GloVe word vectors**pretrained on the 840B Common Crawl corpus are used. **A max sequence length of 600**is used during training and a hidden state size of 200 is used for all recurrent units, Maxout layers, and linear layers. All LSTMs have randomly initialized parameters and an initial state of zero.- The maximum number of iterations is set to 4 and a Maxout pool size of 16 is used.
**EM and F1 metrics**are used.

## 2.2. **SQuAD**

- At the time of writing,
**the single-model DCN**ranks**first at 66.2% exact match and 75.9% F1**on the test data among single-model submissions. - The
**ensemble DCN**ranks first overall at**71.6% exact match**and**80.4% F1**on the test data.

## 2.3. Visualizations

The DCN has the capability to

estimate the start and end points of the answer span multiple times, each time conditioned on its previous estimates.The model is able to explore local maxima corresponding to multiple plausible answers.

- (Please feel free to read the paper directly for other results.)