Tutorial: Creating HDF5 Dataset

A Simple Tutorial to Create Hierarchical Data Format (HDF5) Dataset Using CIFAR-10 Dataset As Example

Sik-Ho Tsang
3 min readJul 18, 2020

In this story, a simple tutorial is described to create a Hierarchical Data Format (HDF5) dataset using the CIFAR-10 dataset as example. I study this because HDF5 may help me for deep learning model training. (Sik-Ho Tsang @ Medium)

Hierarchical Data Format (HDF5) Dataset (From https://www.neonscience.org/about-hdf5)

Here are some advantages of HDF5:

  • The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data.
  • HDF5 uses a “file directory” like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer. The HDF5 format also allows for embedding of metadata making it self-describing.
  • Of course, the support of Python.

Let’s get started!

Outline

  1. Include Package Library
  2. Download CIFAR-10 Dataset
  3. Create HDF5 dataset
  4. Read HDF5 Dataset

1. Include Package Library

  • h5py is the one we need for HDf5 dataset.
from keras.datasets import cifar10
import numpy as np
import h5py

2. Download CIFAR-10 Dataset

  • I think many people are already familar with this step:
(x_img_train, y_label_train), (x_img_test, y_label_test) = cifar10.load_data()
  • If we get the shape of the CIFAR-10 dataset:
print('x_img_train:', x_img_train.shape)
print('y_label_train:', y_label_train.shape)
print('x_img_test:', x_img_test.shape)
print('y_label_test:', y_label_test.shape)

3. Create HDF5 dataset

# create HDF5 file
with h5py.File('dataset_cifar10.hdf5', 'w') as hf:
dset_x_train = hf.create_dataset('x_train', data=x_img_train, shape=(50000, 32, 32, 3), compression='gzip', chunks=True)
dset_y_train = hf.create_dataset('y_train', data=y_label_train, shape=(50000, 1), compression='gzip', chunks=True)
dset_x_test = hf.create_dataset('x_test', data=x_img_test, shape=(10000, 32, 32, 3), compression='gzip', chunks=True)
dset_y_test = hf.create_dataset('y_test', data=y_label_test, shape=(10000, 1), compression='gzip', chunks=True)
  • dataset_cifar10.hdf5: The name of the HDF5 file.
  • ‘w’: write permission.

3.1. create_dataset

  • create_dataset: create a dataset in the HDF5. As shown above, 4 datasets are created.
  • data: when creating the dataset, the data to be read from is specified.
  • shape: The shape here needs to be known.
  • compression: The method to compress the data.
  • chunks: whether chunks or not.

You may ask, why need to use HDF5 dataset? HDF5 can support large amount of data, in which the dataset size is larger than the RAM size.

Creating HDF5 without specifying the data and data shape will be shown in the coming next tutorial .

Also, recursive reading multiple files under multiple sub-directories into the dataset is also covered.

4. Read HDF5 Dataset

# read HDF5 file
with h5py.File('dataset_cifar10.hdf5', 'r') as hf:
dset_x_train = hf['x_train']
dset_y_train = hf['y_train']
dset_x_test = hf['x_test']
dset_y_test = hf['y_test']

print(dset_x_train)
print(dset_y_train)
print(dset_x_test)
print(dset_y_test)
  • ‘r’: read mode is used.
  • print: If we print the HDF5 dataset, we can know the shape and data type.
  • When we want to read the data, it’s just like numpy. You can try it.

I study HDF5 since it can support large amount of data, in which the dataset size is larger than the RAM size, for me to train the deep learning model.

With the use of HDF5, I can increase the batch size for training as well.

Creating HDF5 without specifying the data and data shape will be shown in the coming next tutorial . Also, recursive reading multiple files under multiple sub-directories into the dataset is also covered. Please stay tuned.

This is the 14th story in this month.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.