Review: Exploring the Limits of Language Modeling

CNN Input & CNN Softmax

Unsplashed Image (

In this story, Exploring the Limits of Language Modeling, by Google Brain, is briefly reviewed. There are two key challenges: corpora and vocabulary sizes, and complex, long term structure of language. In this paper:

  • A Softmax loss is designed, which is based on character level CNNs, is efficient to train, and is as precise as a full Softmax which has orders of magnitude more parameters.
  • An exhaustive study is done on techniques such as character Convolutional Neural Networks (CNN) or Long-Short Term Memory (LSTM), on the One Billion Word Benchmark.

This is a paper in 2016 arXiv with over 1000 citations. (Sik-Ho Tsang @ Medium)


  1. Full Softmax
  2. CNN Input & CNN Softmax
  3. Experimental Results

1. Full Softmax

Standard LSTM Language Model (LM)
  • Standard LSTM Language Model (LM) uses full softmax to predict the word, it needs to estimate the the probability vector using the whole vocabulary V which takes a lot of time:
  • where zw is the logit corresponding to a word w.
  • Hierarchical softmax, NCE, or Important Sampling needs to be used to approximate full softmax.

2. CNN Input & CNN Softmax

  • The character-level features allow for a smoother and compact parametrization of the word embeddings.
  • For the Character-level LM that consume characters as inputs or as targets, each word is fed to the model as a sequence of character IDs.
  • The total number of characters are very limited.
  • The full Softmax computes a logit zw as:
  • where h is a context vector and ew the word embedding.

2.1. Character CNN as Input and Output

LM where both input and Softmax embeddings have been replaced by a character CNN
  • Instead of building a matrix of |V|×|h| (whose rows correspond to ew), CNN Softmax produces ew with a CNN over the characters of w as:
  • Also, the input is also a CNN Input:
CNN Input (Figure from 2016 AAAI Character-Aware Neural Language Models)

2.2. Character CNN as Input, Next character prediction LSTM as Output

LM that replaces the Softmax by a next character prediction LSTM network
  • In this LM, the word and character-level models are combined by feeding a word-level LSTM hidden state h into a small LSTM that predicts the target word one character at a time.
  • The standard LSTM model is trained until convergence, then its weights, are frozen. And the standard word-level Softmax layer is replaced with the aforementioned character-level LSTM.

3. Experimental Results

Best results of single models on the 1B word benchmark
  • The 1B word benchmark dataset contains about 0.8B words with a vocabulary of 793471 words.

2-LAYER LSTM-8192–1024 (BIG LSTM) obtains 30.6 Test Perplexity. This is a word-level LSTM model, with 1.8B number of parameters.

BIG LSTM+CNN INPUTS obtains a little better of 30.0 Test Perplexity, with only 1.04B number of parameters only.

  • Using Char LSTM Predictions as in 2.2, even worse performance.
  • (Please feel free to read other results if interested.)
  • Finally, 2-LAYER LSTM-8192–1024 (BIG LSTM) has been used or compared in many papers later on.


[2016 arXiv] [Jozefowicz arXiv’16]
Exploring the Limits of Language Modeling

Natural Language Processing (NLP)

Language/Sequence Model: 2007 [Bengio TNN’07] 2013 [Word2Vec] [NCE] [Negative Sampling] 2014 [GloVe] [GRU] [Doc2Vec] 2015 [Skip-Thought] 2016 [GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] 2017 [TagLM]
Machine Translation: 2014 [Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch] 2016 [GNMT] [ByteNet] [Deep-ED & Deep-Att] 2017 [ConvS2S] [Transformer]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]

My Other Previous Paper Readings




PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AI capabilities in Image Recognition

Review — SimGAN: Learning from Simulated and Unsupervised Images through Adversarial Training (GAN)

Review — DrLIM: Dimensionality Reduction by Learning an Invariant Mapping

Reproducibility in Data Science

An Introduction to Recurrent Neural Network

K-means Clustering & it’s Real use-case in the Security Domain.

A Simple & Practical Introduction To Essential Techniques Of Feature Reduction

Data Science/Machine Learning study path.

Get the Medium app

Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Review — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The Transformer

Build completely novel pipelines on top of Hugging Face in a few simple steps with PADL

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ViT architecture