Brief Review — Bidirectional LSTM

The First Paper Proposing Bidirectional LSTM

  • We always hear about bidirectional LSTM, but which is the first paper proposing bidirectional LSTM?
  • In 1) 2005 IJCNN, authors mentions: “In this paper, we apply bidirectional training to a Long Short Term Memory (LSTM) network for the first time.”;
  • In 2) 2005 ICANN, the same group of authors, extend the work of 1) 2005 IJCNN.
  • In 3) 2005 JNN, it is an invited paper talking about bidirectional LSTM, which gains the highest number of citations.
  • So, these three papers, coming from the same research group, should be probably the first work having the bidirectional LSTM.


  1. Bidirectional LSTM (2005 ICANN)
  2. Results in 2005 ICANN
  3. Results in 2005 JNN

1. Bidirectional LSTM (2005 ICANN)

LSTM Memory Block with One Cell (The block design arrangement is quite different from those popular ones.)
  • (It is assumed LSTM is understood, which is a memory cell for sequence model, better backpropagation capability compared with vanilla RNN.)
  • Four models are evaluated: Bidirectional LSTM (BLSTM), unidirectional LSTM (LSTM), bidirectional standard RNN (BRNN), and unidirectional RNN (RNN).
  • The LSTM (BLSTM) hidden layers contained 140 (93) blocks of one cell in each, and the RNN (BRNN) hidden layers contained 275 (185) units. This gave approximately 100,000 weights for each network.
  • All LSTM blocks had the following activation functions: logistic sigmoids in the range [−2, 2] for the input and output squashing functions of the cell , and in the range [0, 1] for the gates.
  • The non-LSTM net had logistic sigmoid activations in the range [0, 1] in the hidden layer.
  • As is standard for 1 of K classification, the output layers had softmax activations, and the cross entropy objective function was used for training. There were 61 output nodes, one for each phonemes.

2. Results in 2005 ICANN

  • All experiments were carried out on the TIMIT database. TIMIT contain sentences of prompted English speech, accompanied by full phonetic transcripts. It has a lexicon of 61 distinct phonemes.
  • The training and test sets contain 4620 and 1680 utterances respectively. For all experiments we used 5% (184) of the training utterances as a validation set and trained on the rest.
  • All the audio data is preprocessed into frames using 12 Mel-Frequency Cepstrum Coefficients (MFCCs) from 26 filter-bank channels. The log-energy and the first order derivatives of it and the other coefficients are extracted, giving a vector of 26 coefficients per frame in total.
Framewise Phoneme Classification
  • The LSTM nets were 8 to 10 times faster to train than the standard RNNs, as well as slightly more accurate.
Phoneme Recognition Accuracy for Traditional HMM and Hybrid LSTM/HMM

3. Results in 2005 JNN

Framewise phoneme classification on the TIMIT database: bidirectional LSTM
  • The above table contains the outcomes of 7, randomly initialized, training runs with BLSTM. The standard deviation in the test set scores (0.2%).
Framewise phoneme classification on the TIMIT database
  • BRNN took more than 8 times as long to converge as BLSTM.
  • The training time of 17 epochs for the proposed most accurate network (retrained BLSTM) is remarkably fast, needing just a few hours on an ordinary desktop computer.
Learning curves for BLSTM, BRNN and MLP with no time window
  • The above figure shows the corresponding learning curve.


[2005 IJCNN] [Bidirectional LSTM (BLSTM)]
Framewise Phoneme Classification with Bidirectional LSTM Networks

Language Model / Sequence Model

19972005 [Bidirectional LSTM (BLSTM)] … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]

My Other Previous Paper Readings



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store