Review: Character-Aware Neural Language Models

Character-Level Language Model Using CNN and Highway

(Figure from their presentation slides)
  • Instead of word-level language model, a CNN and a highway network are used over characters, whose output is given to a LSTM recurrent neural network language model (RNN-LM).
  • 60% fewer parameters is achieved.


  1. Notations Related to Language Model
  2. Proposed Model Architecture
  3. Experimental Results

1. Notations Related to Language Model

  • (I think this paper gives a very clear notations related to language model.)
  • Let V be the fixed size vocabulary of words.

1.1. Output

  • A recurrent neural network language model (RNN-LM) is used by applying an affine transformation to the hidden layer (ht) followed by a softmax:
  • where pj is the j-th column of P, which is referred to as the output embedding.

1.2. Input

  • For a conventional RNN-LM which usually takes words as inputs, if wt=k, then the input to the RNN-LM at t is the input embedding xk, the k-th column of the embedding matrix X.

1.3. Negative Log-Likelihood (NLL)

  • If we denote w1:T = [w1, …, wT] to be the sequence of words in the training corpus, training involves minimizing the negative log-likelihood (NLL) of the sequence:

1.4. Perplexity (PPL)

  • Perplexity (PPL) is used to evaluate the performance of our models. Perplexity of a model over a sequence [w1, …, wT] is given by:

2. Proposed Model Architecture

Architecture of the proposed language model applied to an example sentence

2.1. Input & CharCNN

  • The input at time t is an output from a character-level convolutional neural network (CharCNN).
  • Let C be the vocabulary of characters, d be the dimensionality of character embeddings.
  • Suppose that word k is made up of a sequence of characters [c1, …, cl], where l is the length of word k. Then the character-level representation of k is given by the matrix Ck.
  • CharCNN:
  • where <Ck, H> is the convolution with kernel H.
  • Then a bias is added and a nonlinearity tanh is applied to obtain a feature map fk.
  • CharCNN uses multiple filters of varying widths to obtain the feature vector for k..

2.2. Max Over Time

Max Over Time
CharCNN & Max Over Time (Figure from their presentation slides)
  • Finally, the max-over-time is used:
  • This is the feature corresponding to the filter H (when applied to word k). The idea is to capture the most important feature, which is the one with the highest value for a given filter.

2.3. Highway

  • In convention, one layer of an MLP can be applied:
  • where g is a nonlinearity.
  • In this paper, one layer of highway network is used:
  • where t is:
  • where t is transform gate, and (1-t) is carry gate.
  • (Please read Highway if interested.)

2.5. LSTM

  • Finally, LSTM is used with z as input, and output the next word.

2.4. Two Networks (LSTM-Char-Small and LSTM-Char-Large)

Architecture of the small and large model
  • One small model and one large model are designed, which are called LSTM-Char-Small and LSTM-Char-Large models respectively.
  • d = dimensionality of character embeddings; w = filter widths;
  • h = number of filter matrices, as a function of filter width (so the large model has filters of width [1; 2; 3; 4; 5; 6; 7] of size [50; 100; 150; 200; 200; 200; 200] for a total of 1100 filters);
  • f; g = nonlinearity functions; l = number of layers; m = number of hidden units.
  • Two datasets are used for training. Small one is DATA-L, big one is DATA-S.
  • Hierarchical Softmax is used for DATA-L, which is a common strategy for large dataset:
  • where V is randomly split into mutually exclusive and collectively exhaustive subsets V1, …, Vc. r is the cluster index.
  • The first term is simply the probability of picking cluster r, and the second term is the probability of picking word j given that cluster r is picked.

3. Experimental Results

3.1. English Penn Treebank

Performance of the proposed model versus other neural language models on the English Penn Treebank test set
  • The proposed small model significantly outperforms other NLMs of similar size.

3.2. Other Languages

Test set perplexities for DATA-S
Test set perplexities for DATA-L
  • Due to memory constraints, only the small models on DATA-L are trained.
  • Interestingly, no significant differences are observed going from word to morpheme LSTMs on Spanish, French, and English.
  • (Please feel free to read paper if interested in the ablation studies.)



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store