# Review — ELMo: Deep Contextualized Word Representations

## ELMo Blesses You All Happy New Year & Happy Deep Learning !!!

Deep Contextualized Word RepresentationsELMo, by Allen Institute for Artificial Intelligence, and University of Washington2018 NAACL, Over 8000 Citations(Sik-Ho Tsang @ Medium)

Language Model

**Word vectors**are learned from**the internal states of a deep bidirectional language model (biLM)**, which is**pretrained on a large text corpus**.- These representations can be
**easily added to existing models**and**significantly improve the state of the art across six challenging NLP problems**, including*question answering, textual entailment,*and*sentiment analysis*.

# Outline

**Conventional Bidirectional Language Model****ELMo: Embeddings from Language Models****Using biLMs for Supervised NLP Tasks****Pre-trained Bidirectional Language Model Architecture****Experimental Results**

**1. Conventional Bidirectional Language Model**

- Given a sequence of
*N*tokens, (*t*1,*t*2, …,*tN*), a**forward language model**computes the probability of the sequence by modeling the probability of token*tk*given the history (*t*1, …,*tk-*1):

**At each position k, each LSTM layer outputs a context-dependent representation**:

- where
*j*=1, …,*L*. **The top layer LSTM output**:

- is used to predict the next token
*tk*+1 with a Softmax layer. - Similarly, a
**backward LM**:

- A
**biLM**combines both a forward and backward LM. The formulation**jointly maximizes the log likelihood of the forward and backward directions**:

**Separate parameters are maintained for the LSTMs in each direction.**- Overall, this formulation is
**similar to the****TagLM**, with the exception that some weights are shared between directions instead of using completely independent parameters.

# 2. ELMo: Embeddings from Language Models

- ELMo is a task specific
**combination of the intermediate layer representations in the biLM**. - For each token
*tk*, a*L*-layer biLM computes a set of 2*L*+1 representations:

- where
*hLMk*,0 is the token layer and*hLMk*,*j*=[→hLMk,j, ←*hLMk*,*j*], for each biLSTM layer. - For inclusion in a downstream model, ELMo collapses all layers in
*R*into a single vector:

- More generally, a task specific weighting of all biLM layers is computed:

- where
are*stask***softmax-normalized weights**and the**scalar parameter**allows the task model to scale the entire ELMo vector.*γtask*is of practical importance to aid the optimization process.*γ* - Considering that the activations of each biLM layer have a different distribution, in some cases layer normalization (Ba et al., 2016) is applied to each biLM layer before weighting.

# 3. Using biLMs for Supervised NLP Tasks

- Given a sequence of tokens (
*t*1, …,*tN*),**it is standard to form a context-independent token representation**using pre-trained word embeddings.*xk*for each token position - To add ELMo to the supervised model, the weights of the biLM are frozen and then the
**ELMo vector**and pass the*ELMotaskk*is concatenated with*xk***this ELMo enhanced representation [**.*xk*,*ELMotaskk*] into the task RNN - For some tasks (e.g., SNLI, SQuAD),
**further improvements**are observed by also**including ELMo at the output of the task RNN**by introducing another set of output specific linear weights and**replacing**.*hk*with [*hk*,*ELMotaskk*] - It is also found that it is beneficial to add a moderate amount of dropout to ELMo, and in some cases to regularize the ELMo weights using weight decay.

# 4. Pre-trained Bidirectional Language Model Architecture

**The pre-trained biLMs**in this paper are**similar to the architectures in****Jozefowicz arXiv’16****, and****LSTM-Char-CNN**, but modified to support joint training of both directions and**add a residual connection between LSTM layers**.- All embedding and hidden dimensions are halved from the single best model CNN-BIG-LSTM in Jozefowicz arXiv’16.
- Specifically, the final model uses
with 4096 units and 512 dimension projections and*L*=2 biLSTM layers**a residual connection**from the first to second layer. - The context insensitive type representation uses
**2048 character n-gram convolutional filters followed by two highway layers as proposed by****Highway****Network**, and**a linear projection down to a 512 representation**.

As a result, the biLM provides

three layers of representationsfor eachinput token.

- After training for 10 epochs on the 1B Word Benchmark,
**the average forward and backward perplexities is 39.7**, compared to 30.0 for the forward CNN-BIG-LSTM in LSTM-Char-CNN.

Once pretrained, the biLM can compute representations and be fine-tuned for any task.

# 5. Experimental Results

## 5.1. SOTA Results

In every task considered as above,

simply adding ELMoestablishes a new state-of-the-art result, withrelative error reductions ranging from 6–20% over strong base models.

## 5.2. Where to Include ELMo?

- All of the task architectures in this paper include word embeddings only as input to the lowest layer biRNN.

However, It is found that

including ELMo at the output of the biRNN in task-specific architectures improves overall resultsfor some tasks.

## 5.3. Word Sense Disambiguation (WSD)

**Overall, the biLM top layer representations have F1 of 69.0 and are better at WSD then the first layer.**This is competitive with a state-of-the-art WSD-specific supervised model using hand crafted features.

The proposed biLM outperforms the CoVe biLSTM.

## 5.4. POS Tagging

- Unlike WSD, accuracies using the first biLM layer are higher than the top layer.

But just like for WSD,

the biLM achieves higher accuracies than theCoVeencoder.

The biLM’s representations are more transferable to WSD and POS tagging than those inCoVe.

## 5.5. Sample efficiency

**ELMo-enhanced models use smaller training sets more efficiently**than models without ELMo.- In the SRL case,
**the ELMo model with 1% of the training set has about the same F1 as the baseline model with 10% of the training set**. - (There are a lot of results not yet presented. Please feel free to read the paper directly.)

## Reference

[2018 NAACL] [ELMo]

Deep Contextualized Word Representations

## Natural Language Processing (NLP)

**Language/Sequence Model: 2007 **[Bengio TNN’07] **2013 **[Word2Vec] [NCE] [Negative Sampling] **2014** [GloVe] [GRU] [Doc2Vec] **2015 **[Skip-Thought] **2016 **[GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] **2017 **[TagLM] [CoVe] [MoE] **2018 **[GLUE] [T-DMCA] [GPT] [ELMo]**Machine Translation: 2014** [Seq2Seq] [RNN Encoder-Decoder] **2015** [Attention Decoder/RNNSearch] **2016** [GNMT] [ByteNet] [Deep-ED & Deep-Att] **2017 **[ConvS2S] [Transformer] [MoE]**Image Captioning:** **2015 **[m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]