! ! ! NO LONGER MAINTAINED ! ! ! Check https://github.com/zarr-developers/zarr-python.
v0.3
h5pySWMR is a drop-in replacement for the h5py library. h5pySWMR synchronizes read and write access to HDF5 files. It allows parallel reading, but writing is serialized. With h5pySWMR, you can read and write HDF5 files from parallel processes (with threads, there are some limitations, see below) without having to fear data corruption. Note that, with h5py, reading and writing from/to a file can result in data corruption.
Example:
# replaces 'from h5py import File'
from h5pyswmr import File
f = File('test.h5', 'w')
# create a dataset containing a 500x700 random array
f.create_dataset(name='/mygroup/mydataset', data=np.random.random((500, 700)))
# read data back into memory
data = f['/mygroup/mydataset'][:]
# no need to explicitely close the file (files are opened/closed when accessed)
When you want to read and write hdf5 files at the same time, i.e., from parallel processes.
Yes. Read section 'Limitations', though.
Of course it is.
No.
Almost. There is a small overhead due to synchronization and because files must be opened/closed for every operation. This overhead is neglible, especially if you read/write large amounts of data.
HDF5 (Hierarchical Data Format 5) is a binary file format designed to store large amounts of numerical raster data, i.e., arrays. It also allows to store data in so-called groups (hence the name "Hierarchical"). h5py is a great library that provides Pythonic bindings to the HDF5 library.
Even though HDF5 (and h5py) does not allow parallel reading and writing, parallel reading is possible (with the restriction that files are opened only after processes are forked). This allows us — using appropriate synchronization techniques — to provide parallel reading and serialized writing, i.e., processes (reading or writing) are forced to wait while a file is being written to. This is sometimes called "single write multiple read" (SWMR). h5pySWMR implements a standard solution to the readers-writers problem, giving preference to writers. Check the 1971 paper by Courtois, Heymans, and Parnas if you're interested. A redis-server is used to implement inter-process locks and counters.
It will be, soon...
Please open an issue on github.
- True parallel reading can only be achieved with parallel processes. Thread concurrency is not supported. This is a limitation of h5py, which currently does not release the global interpreter lock (GIL) for I/O operations.
- After a crash (or if the process is killed by sending a SIGKILL signal), the redis-based synchronization algorithm may end up in an inconsistent state. This can result in deadlocks or data corruption. Proper process termination (SIGTERM or pressing Ctrl+C) is fine, though.
- Be careful when using h5pySWMR in a multithreaded environment. Signal handling does not work well with threads. Therefore, it is very likely that you end up with pending locks when you terminate threads during I/O operations.
In general, you could simply replace import h5py
with import h5pyswmr as h5py
and everything should work as expected. There are a few differences and
limitations, though:
- TODO
Using pip (globally or in a virtualenv):
$ pip install git+https://github.com/meteotest/h5pySWMR.git
Manually:
$ git clone https://github.com/meteotest/h5pySWMR.git
$ python setup.py install
To make sure everything works as expected, run the following:
import h5pyswmr
h5pyswmr.test()
It probably works with any recent version of Python, h5py, and redis. But I've only tested it with Python 2.7/3.4 and the following library versions:
- h5py 2.3.1 to 2.5
- redis 2.10.3
See http://www.h5py.org for h5py requirements (basically NumPy, Cython and the HDF5 C-library).
h5pyswmr also requires a running redis server (see below).
Note that h5pyswmr is expecting a running redis server on
localhost:6379
(on Debian based systems, apt-get install redis-server
is all you need to do).
These settings are hard-coded but can be modified at run time
(a more elegant solution will be provided in future versions):
import redis
from h5pyswmr import locking
# overwrite redis connection object
locking.redis_conn = redis.StrictRedis(host='localhost', port=6666, db=0,
decode_responses=True)
For performance reasons (after all, hdf5 is all about performance), you may want to keep the redis server on the same machine.