Review — Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Feedforward Neural Network for Word Prediction

4 min readOct 19, 2021

In this story, Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model, by Université de Montréal, is reviewed. This is a paper by Prof. Yoshua Bengio. In this paper:

A feedforward neural network is trained to approximate probabilities over sequences of words.
Adaptive importance sampling is designed to accelerate the training.

This is a paper in 2007 TNN with over 200 citations, where TNN has become TNNLS in 2011, and TNNLS has high impact factor of 10.451. (Sik-Ho Tsang @ Medium) Though this paper mainly targets for predicting the next word, word prediction is the foundation to build a language model.

Outline

Neural Language Model Architecture
Adaptive Importance Sampling
Experimental Results

1. Neural Language Model Architecture

Basically, the network we may think is simple if we compare with the current SOTA approach. Yet, it is amazing at that moment.

For the existing words wt-1 to wt-n+1, they are transformed to zi using the shared weight C.
For the next word that needs to predict, a separate D is used to transform it to z0.
Then, a hidden layer of W (weights) with d (bias) is used with tanh activation to transform z to a:

Finally, the output is a scalar energy function:

where bwt is bias and Vwt is the weight from hidden layer to output layer. To obtain the probability:

where:

i.e. the softmax operation.

2. Adaptive Importance Sampling

2.1. Classical Monte Carlo

At that moment, conventionally, classical Monte Carlo was used to estimate the gradient of the log-likelihood:

2.2. Biased Importance Sampling

In this paper, Biased Importance Sampling is proposed:

where a multiplicative constant w is used.
Thus, the gradient updated is scaled.

which is similar to nowadays weight update procedure.

2.3. Effective Sample Size (ESS)

That is similar to the minibatch size nowadays but ESS is adaptive to w:

3. Experimental Results

Brown corpus dataset is used.
The Brown corpus consists of 1,181,041 words from various American English documents.
The corpus was divided in train (800,000 words), validation (200,000 words), and test (the remaining 180,000 words) sets.
The vocabulary was truncated by mapping all “rare” words (words that appear three times or less in the corpus) into a single special word.
The resulting vocabulary contains 14,847 words.
A simple interpolated trigram, serving as baseline, achieves a perplexity of 253.8 on the test set.

**Training error with respect to number of epochs**

**Validation and test errors with respect to CPU time**

The figure shows that the convergence of both networks is similar. The same holds for validation and test errors.
The network trained by sampling converges to an even lower perplexity than the ordinary one (trained with the exact gradient).

After 9 epochs (26h), its perplexity over the test set is equivalent to that of the one trained with exact gradient at its overfitting point (18 epochs, 113 days).

Surprisingly enough, if letting the sampling-trained model converge, it starts to overfit at epoch 18 — as for classical training — but with a lower test perplexity of 196.6, a 3.8% improvement.
Total improvement in test perplexity with respect to the trigram baseline is 29%.