Review — ELMo: Deep Contextualized Word Representations

ELMo Blesses You All Happy New Year & Happy Deep Learning !!!

Sik-Ho Tsang
6 min readJan 1, 2022
ELMo: Embeddings from Language Models (Image from here)

Deep Contextualized Word Representations
ELMo, by Allen Institute for Artificial Intelligence, and University of Washington
2018 NAACL, Over 8000 Citations (Sik-Ho Tsang @ Medium)
Language Model

  • Word vectors are learned from the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.
  • These representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment, and sentiment analysis.


  1. Conventional Bidirectional Language Model
  2. ELMo: Embeddings from Language Models
  3. Using biLMs for Supervised NLP Tasks
  4. Pre-trained Bidirectional Language Model Architecture
  5. Experimental Results

1. Conventional Bidirectional Language Model

  • Given a sequence of N tokens, (t1, t2, …, tN), a forward language model computes the probability of the sequence by modeling the probability of token tk given the history (t1, …, tk-1):
  • At each position k, each LSTM layer outputs a context-dependent representation:
  • where j=1, …, L.
  • The top layer LSTM output:
  • is used to predict the next token tk+1 with a Softmax layer.
  • Similarly, a backward LM:
  • A biLM combines both a forward and backward LM. The formulation jointly maximizes the log likelihood of the forward and backward directions:
  • Separate parameters are maintained for the LSTMs in each direction.
  • Overall, this formulation is similar to the TagLM, with the exception that some weights are shared between directions instead of using completely independent parameters.

2. ELMo: Embeddings from Language Models

ELMo: Embeddings from Language Models (Image from BERT)
  • ELMo is a task specific combination of the intermediate layer representations in the biLM.
  • For each token tk, a L-layer biLM computes a set of 2L+1 representations:
  • where hLMk,0 is the token layer and hLMk,j =[→hLMk,j, ←hLMk,j], for each biLSTM layer.
  • For inclusion in a downstream model, ELMo collapses all layers in R into a single vector:
  • In the simplest case, ELMo just selects the top layer as in TagLM and CoVe:
  • More generally, a task specific weighting of all biLM layers is computed:
  • where stask are softmax-normalized weights and the scalar parameter γtask allows the task model to scale the entire ELMo vector. γ is of practical importance to aid the optimization process.
  • Considering that the activations of each biLM layer have a different distribution, in some cases layer normalization (Ba et al., 2016) is applied to each biLM layer before weighting.

3. Using biLMs for Supervised NLP Tasks

  • Given a sequence of tokens (t1, …, tN), it is standard to form a context-independent token representation xk for each token position using pre-trained word embeddings.
  • To add ELMo to the supervised model, the weights of the biLM are frozen and then the ELMo vector ELMotaskk is concatenated with xk and pass the this ELMo enhanced representation [xk, ELMotaskk] into the task RNN.
  • For some tasks (e.g., SNLI, SQuAD), further improvements are observed by also including ELMo at the output of the task RNN by introducing another set of output specific linear weights and replacing hk with [hk, ELMotaskk].
  • It is also found that it is beneficial to add a moderate amount of dropout to ELMo, and in some cases to regularize the ELMo weights using weight decay.

4. Pre-trained Bidirectional Language Model Architecture

  • The pre-trained biLMs in this paper are similar to the architectures in Jozefowicz arXiv’16, and LSTM-Char-CNN, but modified to support joint training of both directions and add a residual connection between LSTM layers.
  • All embedding and hidden dimensions are halved from the single best model CNN-BIG-LSTM in Jozefowicz arXiv’16.
  • Specifically, the final model uses L=2 biLSTM layers with 4096 units and 512 dimension projections and a residual connection from the first to second layer.
  • The context insensitive type representation uses 2048 character n-gram convolutional filters followed by two highway layers as proposed by Highway Network, and a linear projection down to a 512 representation.

As a result, the biLM provides three layers of representations for each input token.

  • After training for 10 epochs on the 1B Word Benchmark, the average forward and backward perplexities is 39.7, compared to 30.0 for the forward CNN-BIG-LSTM in LSTM-Char-CNN.

Once pretrained, the biLM can compute representations and be fine-tuned for any task.

5. Experimental Results

5.1. SOTA Results

Test set comparison of ELMo enhanced neural models with state-of-the-art single model baselines across six benchmark NLP tasks.

In every task considered as above, simply adding ELMo establishes a new state-of-the-art result, with relative error reductions ranging from 6–20% over strong base models.

5.2. Where to Include ELMo?

Development set performance for SQuAD, SNLI and SRL when including ELMo at different locations in the supervised model.
  • All of the task architectures in this paper include word embeddings only as input to the lowest layer biRNN.

However, It is found that including ELMo at the output of the biRNN in task-specific architectures improves overall results for some tasks.

5.3. Word Sense Disambiguation (WSD)

All-words fine grained WSD F1
  • Overall, the biLM top layer representations have F1 of 69.0 and are better at WSD then the first layer. This is competitive with a state-of-the-art WSD-specific supervised model using hand crafted features.

The proposed biLM outperforms the CoVe biLSTM.

5.4. POS Tagging

Test set POS tagging accuracies for PTB
  • Unlike WSD, accuracies using the first biLM layer are higher than the top layer.

But just like for WSD, the biLM achieves higher accuracies than the CoVe encoder.

The biLM’s representations are more transferable to WSD and POS tagging than those in CoVe.

5.5. Sample efficiency

Comparison of baseline vs. ELMo performance for SNLI and SRL as the training set size is varied from 0.1% to 100%.
  • ELMo-enhanced models use smaller training sets more efficiently than models without ELMo.
  • In the SRL case, the ELMo model with 1% of the training set has about the same F1 as the baseline model with 10% of the training set.
  • (There are a lot of results not yet presented. Please feel free to read the paper directly.)


[2018 NAACL] [ELMo]
Deep Contextualized Word Representations

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] 2017 [TagLM] [CoVe] [MoE] 2018 [GLUE] [T-DMCA] [GPT] [ELMo]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer] [MoE]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.