Review: Semi-Supervised Sequence Tagging with Bidirectional Language Models (TagLM)

Sequence Tagging Using Bidirectional LSTM and Pretrained Language Model

BIO tag representing the Beginning, Inner, and Outside of entities (Image from
  • Language-Model Augmented Sequence Tagger (TagLM) is proposed where pretrained context embeddings from bidirectional language models are added to NLP systems and particularly applied to sequence labeling tasks.


  1. TagLM Overview
  2. TagLM: Network Architecture
  3. Experimental Results

1. TagLM Overview

TagLM Overview

2. TagLM: Network Architecture

Overview of TagLM

2.1. Token Representation (Bottom-Left)

  • Given a sentence of tokens (t1; t2, …, tN) it first forms a representation, xk, for each token by concatenating a character based representation ck with a token embedding wk:
  • The token embeddings, wk, are obtained as a lookup E(.), initialized using pre-trained word embeddings, and fine tuned during training.

2.2. Sequence Representation (Top-Left)

  • To learn a context sensitive representation, multiple layers of bidirectional RNNs are used.
  • For each token position, k, the hidden state hk,i of RNN layer i is formed by concatenating the hidden states from the forward and backward RNNs. As a result, the bidirectional RNN is able to use both past and future information to make a prediction at token k.
  • In this paper, L = 2 layers of RNNs are used, either GRU or LSTM.
  • More formally, for the first RNN layer that operates on xk to output hk,1:
  • Finally, the output of the final RNN layer hk,L is used to predict a score for each possible tag using a single dense layer.
CRF (Image from
CRF (Image from

2.3. Bidirectional LM (Right)

  • There is the forward LM embedding of the token at position k and is the output of the top LSTM layer in the language model. The language model predicts the probability of token tk+1 using a softmax layer over words in the vocabulary.
  • Similarly for backward LM embedding, to capture the future context.
  • These two LMs are pretrained using large corpus.
  • After pre-training the forward and backward LMs separately, the top layer softmax is removed and the forward and backward LM embeddings are concatenated.

2.4. Combining LM with Sequence Model (Middle)

  • TagLM uses the LM embeddings as additional inputs to the sequence tagging model.
  • In particular, the LM embeddings hLM is concatenated with the output from one of the bidirectional RNN layers in the sequence model.

3. Experimental Results

3.1. SOTA Comparison Without Additional Data

Test set F1 comparison on CoNLL 2003 NER task, using only CoNLL 2003 data and unlabeled text
Test set F1 comparison on CoNLL 2000 Chunking task using only CoNLL 2000 data and unlabeled text

3.2. SOTA Comparison With Additional Data

Improvements in test set F1 in CoNLL 2003 NER when including additional labeled data or task specific gazetteers
Improvements in test set F1 in CoNLL 2000 Chunking when including additional labeled data
  • TagLM outperforms previous state of the art results in both tasks when external resources (labeled data or task specific gazetteers).


[2017 ACL] [TagLM]
Semi-Supervised Sequence Tagging with Bidirectional Language Models

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] 2017 [TagLM]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store