Review — Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Feedforward Neural Network for Word Prediction

  • A feedforward neural network is trained to approximate probabilities over sequences of words.
  • Adaptive importance sampling is designed to accelerate the training.

Outline

  1. Neural Language Model Architecture
  2. Adaptive Importance Sampling
  3. Experimental Results

1. Neural Language Model Architecture

Neural Language Model Architecture
  • Basically, the network we may think is simple if we compare with the current SOTA approach. Yet, it is amazing at that moment.
  • For the existing words wt-1 to wt-n+1, they are transformed to zi using the shared weight C.
  • For the next word that needs to predict, a separate D is used to transform it to z0.
  • Then, a hidden layer of W (weights) with d (bias) is used with tanh activation to transform z to a:
  • Finally, the output is a scalar energy function:
  • where bwt is bias and Vwt is the weight from hidden layer to output layer. To obtain the probability:
  • where:
  • i.e. the softmax operation.

2. Adaptive Importance Sampling

2.1. Classical Monte Carlo

  • At that moment, conventionally, classical Monte Carlo was used to estimate the gradient of the log-likelihood:

2.2. Biased Importance Sampling

  • In this paper, Biased Importance Sampling is proposed:
  • where a multiplicative constant w is used.
  • Thus, the gradient updated is scaled.
  • which is similar to nowadays weight update procedure.

2.3. Effective Sample Size (ESS)

  • That is similar to the minibatch size nowadays but ESS is adaptive to w:

3. Experimental Results

  • Brown corpus dataset is used.
  • The Brown corpus consists of 1,181,041 words from various American English documents.
  • The corpus was divided in train (800,000 words), validation (200,000 words), and test (the remaining 180,000 words) sets.
  • The vocabulary was truncated by mapping all “rare” words (words that appear three times or less in the corpus) into a single special word.
  • The resulting vocabulary contains 14,847 words.
  • A simple interpolated trigram, serving as baseline, achieves a perplexity of 253.8 on the test set.
Training error with respect to number of epochs
Validation and test errors with respect to CPU time
  • The figure shows that the convergence of both networks is similar. The same holds for validation and test errors.
  • The network trained by sampling converges to an even lower perplexity than the ordinary one (trained with the exact gradient).
  • Surprisingly enough, if letting the sampling-trained model converge, it starts to overfit at epoch 18 — as for classical training — but with a lower test perplexity of 196.6, a 3.8% improvement.
  • Total improvement in test perplexity with respect to the trigram baseline is 29%.
  • The required number of samples with the non-adaptive unigram was growing exponentially.

Reference

[2007 TNN] [Bengio TNN’07]
Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Natural Language Processing (NLP)

Language Model: 2007 [Bengio TNN’07]
Machine Translation: 2014
[Seq2Seq] [RNN Encoder-Decoder] 2015 [Attention Decoder/RNNSearch]
Image Captioning: 2015 [m-RNN] [R-CNN+BRNN] [Show and Tell/NIC]

My Other Previous Paper Readings

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store