Tutorial: A Good Toy Dataset for LSTM Model

Introducing a Good Toy Dataset for LSTM Model

Weather and Climate in Delhi (Figure from tripsavvy.com)

In this story, I would like to introduce a toy dataset for practicing a simple LSTM model.

Recently, I am looking for a simple LSTM model source code, which can be easily implemented in a short period of time. So, what I need is to look for is a good toy dataset, as well as a good associated LSTM model, which is easy for me to modify. (Meanwhile, I don’t need feature extraction module such as DWT, HHT, or CNN.)


  1. Delhi Climate Dataset
  2. LSTM Model
  3. Results

1. Delhi Climate Dataset

Delhi Climate Dataset
  • By searching in Kaggle, I come up with a dataset called Delhi Climate Dataset:


I’ve just simply downloaded the dataset in Kaggle.

  • It got 2 csv files, one for training, and one for testing. If validation set is needed for hyperparameter tuning, training set should be further split.
  • Timestamp: This dataset has the date at the left hand side, which is the timestamp.
  • Input/Output Features: At the right hand side, we got mean temperature, relative humidity, wind speed, and mean pressure. Three of them can be the input features, and the remaining one can be the output labels that we want to predict.

2. LSTM Model

  • Another amazing thing is that many Kaggle contributors provide their codes. As my target is to choose a simple LSTM model, I come up with the LSTM model in the below link:


I’ve just followed the tutorial to load the dataset.

LSTM Model
  • A TensorFlow+Keras framework is used to implement a LSTM model as shown above.

3. Results


By running the codes, we got the MSE of 2.439 and the curve of predicted temperature.

  • (Indeed here, it always uses actual values to predict the new future value. That’s why it obtains a low MSE, and a so good predicted temperature curve. Autoregression should be used instead.)

If we got a similar dataset structure or problem, the above codes would be a good one to kick start the LSTM model. Of course, some modifications may be needed, for example: the loading of your own dataset, as well as the model architecture (number of input features, number of layers, and number of hidden neurons).

But of course, the performance of your own dataset may not be good using the above codes and may need to have fine-tuning. It also depends on many factors, e.g.: the size of your dataset, whether number of samples are large enough, and whether the samples are having good quality.

Nevertheless, with the above sample codes, I don’t need to start from zero, e.g. implement the dataset loading, and LSTM model, figure visualization, and so on, which can save a lot of time.

Kaggle as a service!