# Review: Character-Aware Neural Language Models

## Character-Level Language Model Using CNN and Highway

In this story, **Character-Aware Neural Language Models**, (LSTM-Char-CNN), by Harvard University, and New York University, is reviewed. In this paper:

- Instead of word-level language model,
**a CNN and a highway network are used over characters**, whose output is given to a LSTM recurrent neural network language model (RNN-LM). **60% fewer parameters**is achieved.

This is a paper in **2016 AAAI **with over **1700 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**Notations Related to Language Model****Proposed Model Architecture****Experimental Results**

**1. Notations Related to Language Model**

- (I think this paper gives a very clear notations related to language model.)
- Let
*V***fixed size vocabulary of words**.

A language model

specifies a distribution over(whose support iswt+1V)given the historical sequence.w1:t= [w1, …,wt]

**1.1. Output**

- A recurrent neural network language model (RNN-LM) is used by applying an
**affine transformation to the hidden layer (**followed by a*ht)***softmax**:

- where
*pj*is the*j*-th column of, which is referred to as the*P***output embedding**.

## 1.2. Input

- For a conventional RNN-LM which usually takes words as inputs, if
, then the input to the RNN-LM at*wt*=*k**t*is the**input embedding**, the k-th column of the embedding matrix*xk**X*.

## 1.3. Negative Log-Likelihood (NLL)

- If we denote
*w*1:*T*= [*w*1, …,*wT*] to be the sequence of words in the training corpus,**training**involves**minimizing the negative log-likelihood (NLL) of the sequence**:

## 1.4. Perplexity (PPL)

**Perplexity (PPL)**is used to evaluate the performance of our models. Perplexity of a model over a sequence [*w*1, …,*wT*] is given by:

The proposed model here

simply replaces the input embeddings, to be described below.Xwith the output from a character-level CNN

**2. Proposed Model Architecture**

## 2.1. Input & CharCNN

- The input at time
*t*is an output from a**character-level convolutional neural network (CharCNN)**. - Let
be the*C***vocabulary of characters**,be the*d***dimensionality of character embeddings**. - Suppose that
**word**is made up of*k***a sequence of characters [**, where*c*1, …,*cl*]*l*is the length of word*k*. Then the**character-level representation**of*k*is given by the**matrix**.*Ck* **CharCNN**:

- where
**<**is the*Ck*,*H*>**convolution with****kernel**.*H* - Then
**a bias**is added and**a nonlinearity tanh**is applied to**obtain a feature map***fk.* - CharCNN uses
**multiple filters of varying widths**to obtain the feature vector for k..

## 2.2. Max Over Time

- Finally, the
**max-over-time**is used:

- This is the feature corresponding to the filter
*H*(when applied to word*k*). The idea is to**capture the most important feature**, which is the one with the highest value**for a given filter**.

## 2.3. Highway

- In convention, one layer of an MLP can be applied:

- where
*g*is a nonlinearity. - In this paper,
**one layer of highway network**is used:

- where
*t*is:

- where
*t***transform gate**, and**(1-**is*t*)**carry gate**.

Similar to the memory cells in LSTM networks,

highway layers allow for training of deep networks by adaptively carrying some dimensions of the input directly to the output.

- (Please read Highway if interested.)

## 2.5. LSTM

- Finally, LSTM is used with
*z*as input, and output the next word.

## 2.4. Two Networks (**LSTM-Char-Small and LSTM-Char-Large)**

- One small model and one large model are designed, which are called
**LSTM-Char-Small**and**LSTM-Char-Large**models respectively. *d*= dimensionality of character embeddings;*w*= filter widths;*h*= number of filter matrices, as a function of filter width (so the large model has filters of width [1; 2; 3; 4; 5; 6; 7] of size [50; 100; 150; 200; 200; 200; 200] for a total of 1100 filters);- f; g = nonlinearity functions;
*l*= number of layers;*m*= number of hidden units. - Two datasets are used for training. Small one is DATA-L, big one is DATA-S.
**Hierarchical Softmax**is used for DATA-L, which is a common strategy for large dataset:

- where
*V*is randomly split into mutually exclusive and collectively exhaustive subsets*V*1, …,*Vc*.*r*is the cluster index. - The
**first term**is simply**the probability of picking cluster**, and the*r***second term**is**the probability of picking word j given that cluster**.*r*is picked

# 3. Experimental Results

## 3.1. **English Penn Treebank**

The proposed large model is

on par with the existing state-of-the-art(Zaremba et al. 2014), despite having approximately60% fewer parameters.

- The proposed small model significantly outperforms other NLMs of similar size.

## 3.2. Other Languages

The character-level models

outperform their word-level counterparts.

- Due to memory constraints, only the small models on DATA-L are trained.
- Interestingly, no significant differences are observed going from word to morpheme LSTMs on Spanish, French, and English.

The character models again

outperform the word/morpheme models.

- (Please feel free to read paper if interested in the ablation studies.)

## References

[2016 AAAI] [LSTM-Char-CNN]

Character-Aware Neural Language Models

Their Slides: https://people.csail.mit.edu/dsontag/papers/kim_etal_AAAI16_slides.pdf

## Natural Language Processing (NLP)

**Language/Sequence Model: 2007 **[Bengio TNN’07] **2013 **[Word2Vec] [NCE] [Negative Sampling] **2014** [GloVe] [GRU] [Doc2Vec] **2015 **[Skip-Thought] **2016 **[GCNN/GLU] [context2vec] [Jozefowicz arXiv’16] [LSTM-Char-CNN] **2017 **[TagLM]**Machine Translation: 2014** [Seq2Seq] [RNN Encoder-Decoder] **2015** [Attention Decoder/RNNSearch] **2016** [GNMT] [ByteNet] [Deep-ED & Deep-Att] **2017 **[ConvS2S] [Transformer]**Image Captioning:** **2015 **[m-RNN] [R-CNN+BRNN] [Show and Tell/NIC] [Show, Attend and Tell]