Review: Character-Aware Neural Language Models

Character-Level Language Model Using CNN and Highway

(Figure from their presentation slides)
  • 60% fewer parameters is achieved.


  1. Notations Related to Language Model
  2. Proposed Model Architecture
  3. Experimental Results

1. Notations Related to Language Model

  • (I think this paper gives a very clear notations related to language model.)
  • Let V be the fixed size vocabulary of words.

1.1. Output

  • A recurrent neural network language model (RNN-LM) is used by applying an affine transformation to the hidden layer (ht) followed by a softmax:

1.2. Input

  • For a conventional RNN-LM which usually takes words as inputs, if wt=k, then the input to the RNN-LM at t is the input embedding xk, the k-th column of the embedding matrix X.

1.3. Negative Log-Likelihood (NLL)

  • If we denote w1:T = [w1, …, wT] to be the sequence of words in the training corpus, training involves minimizing the negative log-likelihood (NLL) of the sequence:

1.4. Perplexity (PPL)

  • Perplexity (PPL) is used to evaluate the performance of our models. Perplexity of a model over a sequence [w1, …, wT] is given by:

2. Proposed Model Architecture

Architecture of the proposed language model applied to an example sentence

2.1. Input & CharCNN

  • Let C be the vocabulary of characters, d be the dimensionality of character embeddings.
  • Suppose that word k is made up of a sequence of characters [c1, …, cl], where l is the length of word k. Then the character-level representation of k is given by the matrix Ck.
  • CharCNN:
  • Then a bias is added and a nonlinearity tanh is applied to obtain a feature map fk.
  • CharCNN uses multiple filters of varying widths to obtain the feature vector for k..

2.2. Max Over Time

Max Over Time
CharCNN & Max Over Time (Figure from their presentation slides)

2.3. Highway

  • In this paper, one layer of highway network is used:

2.5. LSTM


2.4. Two Networks (LSTM-Char-Small and LSTM-Char-Large)

Architecture of the small and large model
  • d = dimensionality of character embeddings; w = filter widths;
  • h = number of filter matrices, as a function of filter width (so the large model has filters of width [1; 2; 3; 4; 5; 6; 7] of size [50; 100; 150; 200; 200; 200; 200] for a total of 1100 filters);
  • f; g = nonlinearity functions; l = number of layers; m = number of hidden units.
  • Two datasets are used for training. Small one is DATA-L, big one is DATA-S.
  • Hierarchical Softmax is used for DATA-L, which is a common strategy for large dataset:
  • The first term is simply the probability of picking cluster r, and the second term is the probability of picking word j given that cluster r is picked.

3. Experimental Results

3.1. English Penn Treebank

Performance of the proposed model versus other neural language models on the English Penn Treebank test set

3.2. Other Languages

Test set perplexities for DATA-S
Test set perplexities for DATA-L
  • Interestingly, no significant differences are observed going from word to morpheme LSTMs on Spanish, French, and English.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store